CSE303
Lecture 3: Exploratory Data Analysis
ATTRIBUTES
• Data points or Samples are described by attributes.
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• Types
• Nominal or Categorical String values
• Ordinal String but the values have an implicit order
• Binary only two states, 0/1, T/F, positive/negative
• Numerical number (integer, real-values)
2
ATTRIBUTE TYPES
• Nominal: categories, states, or “names of things”
• Hair color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Ordinal: Values have a meaningful order (ranking) but magnitude between
successive values is not known.
• Size = {small, medium, large}, grades, army rankings
• Binary: Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs.
negative)
• Numeric: represents quantity (integer or real-valued) 3
• Temperature, length, counts, grade point, CGPA, salary etc.
DISCRETE VS. CONTINUOUS ATTRIBUTES
• Discrete Attribute: has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute: has real numbers as attribute values
• E.g., temperature, height, or weight
• Continuous attributes are typically represented as floating-point variables
4
A SAMPLE DATASET
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no 5
EXPLORING DATA DISTRIBUTION
• There are many visual representation methods to explore the distribution of
data.
• Boxplot: a five-number summary (min, Q1, median, Q3, max)
• Frequency Table: A tally of the count of numeric data values that fall into a
set of intervals (bins).
• Histogram: A plot of the frequency table with the bins on the x-axis and the count
(or proportion) on the y-axis.
• Density Plot: smoothed version of Histogram.
6
EXAMPLE: HISTOGRAM
ANOTHER EXAMPLE OF HISTOGRAM
Runs Frequency
Scored in
First
Innings
80-100 3
100-120 4
120-140 7
140-160 12
160-180 3
180-200 3
200-220 2
8
NORMAL OR GAUSSIAN DISTRIBUTION
9
10
11
DEFINITION OF BOXPLOT
• It is a 5-number summary
• MIN, LOWER Quartile (Q1), Median, Upper Quartile (Q3), MAX
• BOXPLOT is efficient to find outliers.
• Upper Extreme = Q3 + IQR X 1.5
• Lower Extreme = Q1 – IQR X 1.5
• If any data point exists that does not contained within the boundary of Lower
and Upper Extreme then those datapoints can be identified as OUTLIERS.
12
outlook temperature humidity windy play Min = 64
sunny 85 85 FALSE no Q1 = 25% X 14 = 3.5th value =
sunny 80 90 TRUE no 68.5
overcast 83 86 FALSE yes Median = 72
rainy 70 96 FALSE yes Q3 = 75% X 14 = 10.5th value =
rainy 68 80 FALSE yes 77.5
rainy 65 70 TRUE no Max = 85
overcast 64 65 TRUE yes
sunny 72 95 FALSE no IQR = Q3-Q1 = 9
sunny 69 70 FALSE yes Upper Extreme = Q3 + IQR X 1.5
rainy 75 80 FALSE yes = 77.5 + 9 X
sunny 75 70 TRUE yes 1.5
overcast 72 90 TRUE yes = 91
overcast 81 75 FALSE yes
rainy 71 91 TRUE no Lower Extreme = Q1 – IQR X 1.5
= 68.5 – 9 X 1.5
64, 64, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85 = 55
Any values greater than Upper
Extremewhichever
Upper Whisker – it will be extended till Max or Upper Extreme, or smaller than Lower
is lower.
Extreme
Lower Whisker – it will be extended till Min or Lower Extreme, would be called
whichever as 13
is higher.
Each data point which are greater than Upper Extreme or OUTLIERS.
smaller than Lower
Extreme would be represented as a dot in the Boxplot.
EXAMPLE: BOXPLOT
Sometimes, BoxPlot can be created for comparing the data distribution of a particular
feature (price) based on the values of Another Feature (HouseType).
HouseType Price
H 200000
H 150000
U 75000
U 54000
T 100000
T 90000
Another Example:
Comparing the data
distribution of First Innings
Score based on the different
Venues.
15
EXPLORING NOMINAL AND BINARY DATA
• For nominal (categorical) data, simple proportions or percentages can give us
the insight.
• Mode: most commonly occurring category or value in a data set.
• Expected value: When the categories can be associated with a numeric value, this
gives an average value based on a category’s
probability of occurrence.
• Bar charts: The frequency or proportion of each category plotted as bars.
• Pie charts: The frequency of proportion of each category plotted as wedges in a
pie.
16
EXAMPLE: BAR CHART
EXAMPLE: PIE CHART
18
EXPLORING TWO OR MORE VARIABLES
• Contingency Tables: A tally of counts between two or more categorical
variables
• Scatterplots: shows relationship between two numeric variables. Not suitable
for many data points.
• Hexagonal binning: A plot of two numeric variables with the records binned
into hexagons
• Boxplots: A simple way to visually compare the distributions of a numeric
variable grouped according to a categorical variable.
19
CONTINGENCY TABLE
Gender Designation
Male Professor
Female Assoc. Professor
Male Asst. Professor
Contingency Table:
Gender-wise Breakdown of Different
Designation Gender Posts
Male Female
Professor 3 0
Assoc. Professor 4 1
Asst. Professor 3 0
Sr. Lecturer 4 3 20
Lecturer 2 2
Smoker? Breathing Problem
Yes No
Non-smoker 1 10
Smoker 4 3
21
EXAMPLE: SCATTER PLOT
22
PROBLEM OF SCATTER PLOT
• df1 = pd.DataFrame({'x_axis': np.random.rand(50), 'y_axis': np.random.rand(50)})
• df2 = pd.DataFrame({'x_axis': np.random.rand(10000), 'y_axis': np.random.rand(100
00)})
EXAMPLE: HEXAGONAL BINNING
A hexagonal plot is
useful for a large
dataset. It helps to bin
the area of the chart and
assigns color intensity
based on the frequency
on that bin.
INTRODUCING PANDAS DATAFRAME
• Pandas DataFrame is two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns).
• A Data frame is a two-dimensional data structure, i.e., data is aligned in a
tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.
25
EXAMPLE: PANDAS DATAFRAME
26
CREATING PANDAS DATAFRAME USING
DICTIONARY
import pandas as pd
dict1 = {'id':[1,2,3],'name':['alice','bob','charlie'],'age':[20, 25, 32]}
df1 = pd.DataFrame(dict1)
print(df1)
27
CREATING PANDAS DATAFRAME USING
CSV FILE
df = pd.read_csv('../sample_data_1.csv', header = None)
df.columns=['id','state','population','murder_rate’]
print(df)
df.head() # displays first 5 rows
df.tail() # displays last 5 rows
df.count() # displays number of values for each column
28
LIST OF FUNCTIONS ON DATAFRAME
Function Description
count() number of non-null observations
sum() sum of values
mean() mean of values
median() median of values
mode() mode of values
std() standard deviation of values
var() variance of values
quantile() quantile of values
min() minimum value
max() maximum value
abs() absolute value
cumsum() cumulative sum 29
cumprod() cumulative product
USEFUL RESOURCES
• Chapter 1, Practical Statistics for Data Scientists by Bruce and Bruce
• https://pandas.pydata.org/pandas-docs/stable/reference/index.html
• https://etav.github.io/python/count_basic_freq_plot.html
30
THANK YOU
31