Data Science
DAI-101 Spring 2024-25
Dr. Devesh Bhimsaria
Office: F9, Old Building
Department of Biosciences and Bioengineering
Indian Institute of Technology–Roorkee
[email protected] Dr. Devesh Bhimsaria 1
Python with Data Analysis
Data Cleaning: Outliers
⚫ Python program 1 for data cleaning
⚫ Interquartile Range (IQR) is a statistical measure that describes the spread of
the middle 50% of data points in a dataset. It is used to detect variability and
identify outliers in the data. The IQR is calculated as the difference between
the third quartile (Q3) and the first quartile (Q1):
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
⚫ Q1 (First Quartile): The 25th percentile; 25% of the data is smaller than this
value.
⚫ Q3 (Third Quartile): The 75th percentile; 75% of the data is smaller than this
value.
⚫ Median: The 50th percentile of the data, separating it into two halves.
Outlier Thresholds:
⚫ Lower Bound: 𝑄1 − 1.5 ∗ 𝐼𝑄𝑅
⚫ Upper Bound: 𝑄3 + 1.5 ∗ 𝐼𝑄𝑅
⚫ Values outside these thresholds are considered outliers.
Dr. Devesh Bhimsaria 3
Data Cleaning: Outliers
ID Age ID Age
0 1 22 0 1 22
1 2 25 1 2 25
2 3 30 2 3 30
3 4 24 3 4 24
4 5 29 4 5 29
5 6 35 5 6 35
6 7 120 7 8 28
7 8 28 8 9 32
8 9 32 9 10 31
9 10 31 10 11 27
10 11 27 11 12 26
11 12 26 12 13 23
12 13 23 13 14 40
13 14 40 14 15 33
14 15 33 15 16 29
15 16 29 16 17 36
16 17 36 18 19 28
17 18 100 19 20 29
18 19 28
19 20 29
Lower Bound: 16.625
Upper Bound: 43.625
Dr. Devesh Bhimsaria 4
Dr. Devesh Bhimsaria 5
Installing libraries
⚫ On Terminal
⚫ pip install pandas (General)
⚫ pip3 install pandas (Python3)
⚫ After installation import them in code
import pandas as pd
import matplotlib.pyplot as plt
Dr. Devesh Bhimsaria 6
Data Cleaning: Outliers
⚫ Python code
Dr. Devesh Bhimsaria 7
Data Reduction: PCA
⚫ Principal Component Analysis (PCA) is fundamentally based on the
mathematics of eigenvalues and eigenvectors.
⚫ Step 1: Get the covariance matrix- It captures the variance (diagonal
elements) and the correlation between features. If 𝑋 is the centered dataset
(mean 0). 𝑝 × 𝑝 symmetric matrix:
1
Σ= 𝑋𝑇 𝑋
𝑛−1
⚫ Step 2: Solve for eigen values 𝜆 and vector 𝒗
Σ𝒗 = 𝜆𝒗
⚫ Step 3: Ordering the Principal Components in descending order of eigen
values
⚫ Step 4: Principal Components: The eigenvectors are the principal axes that
define the new coordinate system. The data can be projected onto these axes
to form the principal components:
𝑍= 𝑋𝑉
⚫ 𝑍 : Transformed data in the reduced dimension space.
⚫ 𝑉 : Matrix of eigenvectors corresponding to the top 𝑘 eigenvalues.
Dr. Devesh Bhimsaria 8
Data Reduction: PCA
⚫ Explained Variance Ratio
⚫ The explained variance ratio tells you how much of the total variance in the
original data is captured by each principal component (PC).
⚫ Variance measures how much the data spreads out (varies) along a particular
dimension. PCA tries to find new axes (principal components) that
maximize the variance in the data.
⚫ Math: Let-
⚫ 𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ 𝜆𝑖 (sum of all eigenvalues of the covariance matrix),
⚫ 𝜆𝑖 : the eigenvalue corresponding to the i-th principal component.
⚫ The explained variance ratio for the i-th component is:
𝜆𝑖
⚫ 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑎𝑡𝑖𝑜𝑖 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
⚫ This represents the proportion of the dataset’s total variance explained by
the i-th component.
Dr. Devesh Bhimsaria 9
Data Reduction: Cluster & Sample
⚫ Python code
Dr. Devesh Bhimsaria 10
Data Reduction: Cluster & Sample
Dr. Devesh Bhimsaria 11
Data Reduction: Cluster & Sample
Dr. Devesh Bhimsaria 12
Wavelet transform example
13
Data Reduction: Linear regression
Dr. Devesh Bhimsaria 14
Data Reduction: PCA
⚫ Python code
⚫ Original Dataset Shape: (150, 4)
⚫ 2 PCA components:
⚫ Reduced Dataset Shape: (150, 2)
⚫ Explained Variance Ratio: [0.72962445 0.22850762]
⚫ Total Variance Retained: 0.9581320720000164
Dr. Devesh Bhimsaria 15
Data Reduction: PCA
⚫ Python code
⚫ 3 PCA components:
⚫ Reduced Dataset Shape: (150, 3)
⚫ Explained Variance Ratio: [0.72962445 0.22850762 0.03668922]
⚫ Cumulative Variance Retained: [0.72962445 0.95813207 0.99482129]
Dr. Devesh Bhimsaria 16
Data Reduction: Linear regression
⚫ Python code
Dr. Devesh Bhimsaria 17
Data Reduction: Linear regression
Dr. Devesh Bhimsaria 18
Data Reduction: Linear regression
Dr. Devesh Bhimsaria 19
Thank You
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/
Dr. Devesh Bhimsaria 20