Data Viz
Statistics Analytics
Graphical
EDA
Session #3
“The greatest value of a picture is when it
forces us to notice what we never expected
to see.”
John Tukey
CSV
Spreadsheet
A Quick
Refresher
On
Statistics
Statistics
Descriptive Inferential
Measures of Hypothesis
Measures of Measures of
Central
Tendency
Dispersion Distribution Testing
Regression
Mean Range Skewness Analysis
Statistics Standard
Median Deviation Kurtosis
Mode Variance
IQR
56 , 1 , 1 , 7 , 10
60
56
Mean
50
56 + 1 + 1 + 7 + 10
40
5
30
15
20
15
10 10
7
0 1 1
56 , 1 , 1 , 7 , 10
Median 1, 1, 77, 10, 56
7
56 , 1 , 1 , 7 , 10
Mode
56 – 1
1–2 1
7–1
10 – 1
56 , 1 , 1 , 7 , 10
Mean 15
Which is the best
measure of central
Median 7 Tendency ?
Mode 1
56 , 1 , 1 , 7 , 10
Range 56 - 1
55
56 , 1 , 1 , 7 , 10
60
56
50
40
Variance
30
20
15
10 10
7
0 1 1
56 , 1 , 1 , 7 , 10
60
56
50
40
Variance 30
20
15
10 10
7
0 1 1
56 , 1 , 1 , 7 , 10
60
56
50
40
Variance 30
20
15
10 10
7
0 1 1
56 , 1 , 1 , 7 , 10
60
56
50
40
Variance 30
1681
20
64 25 15
10 196 196 10
7
0 1 1
56 , 1 , 1 , 7 , 10
60
56
50
40
Variance 30
1681
20
64 25 15
10 196 196 10
7
)
196 25
(
0 1 1
64
1681 196
Mean
432.4
56 , 1 , 1 , 7 , 10
Standard
Deviation 432.4 20.8
20.8
20.8
• Measure of how dispersed the data is in relation to the mean.
• Small standard deviation indicates data are clustered tightly
around the mean
• Large standard deviation indicates data are more spread out.
Standard
Deviation
Standard
Deviation
IQR Median
(Inter
Quartile
Range)
Represents the middle
50%
of data
12
Box &
Whisker
Plot
Median
Box Plot
Box &
Whisker
Plot
Box Plot
Box Plot
Measure of the asymmetry of the distribution of a variable about its mean
Skewness
Skewness
Right-Skewed Data:
• In business, right-skewed data could indicate a small number of high-value customers or transactions.
• Strategies might focus on retaining these high-value entities while trying to shift more customers towards higher
value.
Left-Skewed Data:
• If the distribution of purchase amounts is left-skewed, it suggests that most customers are making relatively high-
value purchases.
• This can indicate a strong, high-value customer base.
Kurtosis
• Measure of the "tailedness" of the
distribution of a variable
• Shape of a distribution's tails in relation
to its overall shape
• Spike Sales or stable
Correlation
Statistical measure that describes the strength and direction
of a relationship between two variables
Exploratory
Data Analysis
(EDA)
Crucial process of
performing initial
investigations on data
to discover patterns, to
check assumptions with
the help of summary
statistics and graphical
representations
Meta Data
Exploratory
Business
Data Insights
Uni-Variate
Analysis
Analysis
Multivariate Detect
Detect Feature
Analysis Missing
Outliers Engineering
Values
Prescriptive
What should be done
Predictive
What will happen
Analytics
Value Diagnostic
Why something happened
Descriptive
What happened
Complexity
Analysis Vs Analytics
Past and Present
Focus
Looks at Historical Data
Analytics
Future Oriented
Analysis Anticipates Trends
• Explores each variable on its own in a data set
• Descriptive statistics describe and summarize data.
• Central tendency of the values.
• Check for the variability in the data
• Check the shape of data
• Check for missing values
• Check for outliers
• Check Skewness and Kurtosis
Univariate
Analysis
• Involves examining relationships between two or more variables simultaneously
• Correlation Analysis
• Inferential Analysis
• Interaction Analysis
• Visualize relationships using : Scatterplots, Heatmaps, Pair-Plots etc.
• Checking Missing values and Outliers
Multivariate
Analysis
Graphical/Visual Analysis Of Statistical
Measures
✓ What are the central tendencies?
✓ Mean
✓ Median
✓ Mode(s)
✓ What are the dispersion measures?
✓ Range
✓ IQR
✓ Standard Deviation.
✓ Variance
✓ What are the shapes of the distributions?
✓ Uniform
✓ Symmetric or skewed?
• Are there any missing values? How do you Treat them ?
• Are there outliers or extreme values? How do you Treat them ?
Metadata
Filename Catpics#1.jpg
Owner Bella
Created 1st May 2024
Camera ………
……..
Missing
Data
Few Reasons for Missing Values
• Past data might get corrupted due to improper maintenance.
• Observations are not recorded for certain fields due to some reasons.
• There might be a failure in recording the values due to human error.
• The user has not provided the values intentionally
• Item nonresponse: This means the participant refused to respond.
Missing Values
Treatment
Deletion Imputation
Impute Missing Values
Outlier Analysis
Detection and Treatment
Outlier
What are Anomalies ?
• They are abnormal observations that lies far away from other
values
• It is easy to identify it when the observations are just a bunch of
numbers and it is one dimensional
• But when there are thousands of observations or multi-
dimensions, we need more clever ways to detect those values.
Generally, an anomaly is an outcome or value that deviates from what
is expected, but the exact criteria for what determines an anomaly can
vary from situation to situation
Most Common Causes Of Outliers In A Dataset
• Errors
• Natural
• Intentional
# OF UNITS SOLD
0
100
200
300
400
500
600
700
900
800
2:00 AM 1000
4:00 AM
6:00 AM
8:00 AM
10:00 AM
12:00 PM
2:00 PM
TIME
4:00 PM
April
6:00 PM
8:00 PM
10:00 PM
12:00 AM
2:00 AM
4:00 AM
iPhone Sales - Website on 1 st
6:00 AM
8:00 AM
10:00 AM
12:00 PM
₹ 1,50,000
Applications of Anomaly Detection
Anomaly
₹ 10,000
• Credit card fraud is a socially relevant problem
and poses a great threat to businesses all
around the world.
• In order to detect fraudulent transactions made
by the wrongdoer, Outlier Detection Algorithms
are applied
Fraud Detection In Credit Card
Transactions
Outliers
Univariate Multivariate
Outlier Outlier
• Univariate outliers can be found Multivariate outliers can be found in a n-dimensional space (of
when looking at a distribution of n-features).
values in a single feature space
• Central rectangle spans the first quartile to the third quartile (the interquartile range or IQR).
• Segment inside the rectangle shows the median
• “Whiskers” above and below the box show the locations of the minimum and maximum.
• Outliers are marked beyond whiskers.
Google Facets
• Link: https://pair-code.github.io/facets/index.html
• Load the Dataset in LMS : “Session#3_Dataset _Used_Car.xlsx”
• Perform the following Univariate Analysis and Multivariate Analysis
Disclaimer
Few of the graphs and visualizations used in the presentation are not my own and have been taken from following sources.
Sources : Google Images, the Economist, William S. Cleveland and Robert McGill 1984, Data-To-Viz, Oracle, Datavizcatalogue,
blogs.sas.com,steema.com, consumer reports,visual display of quantitative information by E.Tufte, python graph gallery and
lucidchart, Storytelling with data, cole nussbaumer knaflic, Good charts by Scott Berinato