0% found this document useful (0 votes)

23 views31 pages

Lecture3 Classnotes

The document covers exploratory data analysis, focusing on attributes, their types (nominal, ordinal, binary, and numerical), and their classifications as discrete or continuous. It discusses various methods for visualizing data distributions, such as boxplots, histograms, and scatterplots, and introduces the Pandas DataFrame for data manipulation. Additionally, it provides examples of data representation and analysis techniques, including contingency tables and hexagonal binning.

Uploaded by

2023-2-60-284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views31 pages

Lecture3 Classnotes

Uploaded by

2023-2-60-284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

CSE303

Lecture 3: Exploratory Data Analysis

ATTRIBUTES

• Data points or Samples are described by attributes.

• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• Types
• Nominal or Categorical  String values
• Ordinal  String but the values have an implicit order
• Binary  only two states, 0/1, T/F, positive/negative
• Numerical  number (integer, real-values)

2
ATTRIBUTE TYPES
• Nominal: categories, states, or “names of things”
• Hair color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes

• Ordinal: Values have a meaningful order (ranking) but magnitude between

successive values is not known.
• Size = {small, medium, large}, grades, army rankings

• Binary: Nominal attribute with only 2 states (0 and 1)

• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs.
negative)

• Numeric: represents quantity (integer or real-valued) 3

• Temperature, length, counts, grade point, CGPA, salary etc.

DISCRETE VS. CONTINUOUS ATTRIBUTES

• Discrete Attribute: has only a finite or countably infinite set of values

• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes

• Continuous Attribute: has real numbers as attribute values

• E.g., temperature, height, or weight
• Continuous attributes are typically represented as floating-point variables

4
A SAMPLE DATASET
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no 5
EXPLORING DATA DISTRIBUTION

• There are many visual representation methods to explore the distribution of

data.
• Boxplot: a five-number summary (min, Q1, median, Q3, max)
• Frequency Table: A tally of the count of numeric data values that fall into a
set of intervals (bins).
• Histogram: A plot of the frequency table with the bins on the x-axis and the count
(or proportion) on the y-axis.
• Density Plot: smoothed version of Histogram.

6
EXAMPLE: HISTOGRAM
ANOTHER EXAMPLE OF HISTOGRAM
Runs Frequency
Scored in
First
Innings
80-100 3
100-120 4
120-140 7
140-160 12
160-180 3
180-200 3
200-220 2

8
NORMAL OR GAUSSIAN DISTRIBUTION

9
10
11
DEFINITION OF BOXPLOT

• It is a 5-number summary
• MIN, LOWER Quartile (Q1), Median, Upper Quartile (Q3), MAX
• BOXPLOT is efficient to find outliers.
• Upper Extreme = Q3 + IQR X 1.5
• Lower Extreme = Q1 – IQR X 1.5
• If any data point exists that does not contained within the boundary of Lower
and Upper Extreme then those datapoints can be identified as OUTLIERS.

12
outlook temperature humidity windy play Min = 64
sunny 85 85 FALSE no Q1 = 25% X 14 = 3.5th value =
sunny 80 90 TRUE no 68.5
overcast 83 86 FALSE yes Median = 72
rainy 70 96 FALSE yes Q3 = 75% X 14 = 10.5th value =
rainy 68 80 FALSE yes 77.5
rainy 65 70 TRUE no Max = 85
overcast 64 65 TRUE yes
sunny 72 95 FALSE no IQR = Q3-Q1 = 9
sunny 69 70 FALSE yes Upper Extreme = Q3 + IQR X 1.5
rainy 75 80 FALSE yes = 77.5 + 9 X
sunny 75 70 TRUE yes 1.5
overcast 72 90 TRUE yes = 91
overcast 81 75 FALSE yes
rainy 71 91 TRUE no Lower Extreme = Q1 – IQR X 1.5
= 68.5 – 9 X 1.5
64, 64, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85 = 55

Any values greater than Upper

Extremewhichever
Upper Whisker – it will be extended till Max or Upper Extreme, or smaller than Lower
is lower.
Extreme
Lower Whisker – it will be extended till Min or Lower Extreme, would be called
whichever as 13
is higher.
Each data point which are greater than Upper Extreme or OUTLIERS.
smaller than Lower
Extreme would be represented as a dot in the Boxplot.
EXAMPLE: BOXPLOT
Sometimes, BoxPlot can be created for comparing the data distribution of a particular
feature (price) based on the values of Another Feature (HouseType).

HouseType Price

H 200000
H 150000
U 75000
U 54000
T 100000
T 90000

Another Example:
Comparing the data
distribution of First Innings
Score based on the different
Venues.
15
EXPLORING NOMINAL AND BINARY DATA

• For nominal (categorical) data, simple proportions or percentages can give us

the insight.
• Mode: most commonly occurring category or value in a data set.
• Expected value: When the categories can be associated with a numeric value, this
gives an average value based on a category’s
probability of occurrence.
• Bar charts: The frequency or proportion of each category plotted as bars.
• Pie charts: The frequency of proportion of each category plotted as wedges in a
pie.

16
EXAMPLE: BAR CHART
EXAMPLE: PIE CHART

18
EXPLORING TWO OR MORE VARIABLES

• Contingency Tables: A tally of counts between two or more categorical

variables
• Scatterplots: shows relationship between two numeric variables. Not suitable
for many data points.
• Hexagonal binning: A plot of two numeric variables with the records binned
into hexagons
• Boxplots: A simple way to visually compare the distributions of a numeric
variable grouped according to a categorical variable.

19
CONTINGENCY TABLE
Gender Designation
Male Professor
Female Assoc. Professor
Male Asst. Professor

Contingency Table:
Gender-wise Breakdown of Different
Designation Gender Posts
Male Female
Professor 3 0
Assoc. Professor 4 1
Asst. Professor 3 0
Sr. Lecturer 4 3 20

Lecturer 2 2
Smoker? Breathing Problem
Yes No
Non-smoker 1 10
Smoker 4 3

21
EXAMPLE: SCATTER PLOT

22
PROBLEM OF SCATTER PLOT
• df1 = pd.DataFrame({'x_axis': np.random.rand(50), 'y_axis': np.random.rand(50)})
• df2 = pd.DataFrame({'x_axis': np.random.rand(10000), 'y_axis': np.random.rand(100
00)})
EXAMPLE: HEXAGONAL BINNING

A hexagonal plot is
useful for a large
dataset. It helps to bin
the area of the chart and
assigns color intensity
based on the frequency
on that bin.
INTRODUCING PANDAS DATAFRAME

• Pandas DataFrame is two-dimensional size-mutable, potentially

heterogeneous tabular data structure with labeled axes (rows and columns).
• A Data frame is a two-dimensional data structure, i.e., data is aligned in a
tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.

25
EXAMPLE: PANDAS DATAFRAME

26
CREATING PANDAS DATAFRAME USING
DICTIONARY
import pandas as pd
dict1 = {'id':[1,2,3],'name':['alice','bob','charlie'],'age':[20, 25, 32]}
df1 = pd.DataFrame(dict1)
print(df1)

27
CREATING PANDAS DATAFRAME USING
CSV FILE
df = pd.read_csv('../sample_data_1.csv', header = None)
df.columns=['id','state','population','murder_rate’]
print(df)
df.head() # displays first 5 rows
df.tail() # displays last 5 rows
df.count() # displays number of values for each column

28
LIST OF FUNCTIONS ON DATAFRAME
Function Description
count() number of non-null observations
sum() sum of values
mean() mean of values
median() median of values
mode() mode of values
std() standard deviation of values
var() variance of values
quantile() quantile of values
min() minimum value
max() maximum value
abs() absolute value
cumsum() cumulative sum 29

cumprod() cumulative product

USEFUL RESOURCES

• Chapter 1, Practical Statistics for Data Scientists by Bruce and Bruce

• https://pandas.pydata.org/pandas-docs/stable/reference/index.html
• https://etav.github.io/python/count_basic_freq_plot.html

30
THANK YOU

Lecture 3
No ratings yet
Lecture 3
15 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
CH 2
No ratings yet
CH 2
68 pages
Slide-04-Chapter2-Getting To Know Your Data
No ratings yet
Slide-04-Chapter2-Getting To Know Your Data
47 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Data Mining Concepts and Techniques
100% (1)
Data Mining Concepts and Techniques
63 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Lec 2
No ratings yet
Lec 2
26 pages
02 Data
No ratings yet
02 Data
62 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
About Data
No ratings yet
About Data
25 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
87 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
7 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Lect 3
No ratings yet
Lect 3
51 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
02 Data
No ratings yet
02 Data
35 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
02 Data
No ratings yet
02 Data
41 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
CH 2
No ratings yet
CH 2
35 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Module 1
No ratings yet
Module 1
64 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
02 Data
No ratings yet
02 Data
64 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Exploration and Histogram Analysis
No ratings yet
Data Exploration and Histogram Analysis
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Machine Learning Attribute Types Explained
No ratings yet
Machine Learning Attribute Types Explained
31 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
02 Data
No ratings yet
02 Data
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Unit 3
No ratings yet
Unit 3
45 pages
02 Data
No ratings yet
02 Data
65 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
02 Data
No ratings yet
02 Data
66 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
QUESTION
No ratings yet
QUESTION
1 page
QUESTION
No ratings yet
QUESTION
3 pages
Lecture 6
No ratings yet
Lecture 6
28 pages
Lecture8 Notes (Matrices)
No ratings yet
Lecture8 Notes (Matrices)
13 pages
Lecture 7
No ratings yet
Lecture 7
14 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Lecture 5
No ratings yet
Lecture 5
32 pages
CH 12
No ratings yet
CH 12
31 pages
Lecture 4
No ratings yet
Lecture 4
30 pages
Expt 5
No ratings yet
Expt 5
3 pages
Expt 7
No ratings yet
Expt 7
2 pages
Expt 6
No ratings yet
Expt 6
3 pages
Visa Commercial Solutions: Merchant Category Codes For IRS Form 1099-MISC Reporting
No ratings yet
Visa Commercial Solutions: Merchant Category Codes For IRS Form 1099-MISC Reporting
15 pages
Turbines Notes
No ratings yet
Turbines Notes
4 pages
Electricity
100% (1)
Electricity
21 pages
Tremors in The Sand
100% (2)
Tremors in The Sand
30 pages
Beauty & Fragrance Product Info Form
No ratings yet
Beauty & Fragrance Product Info Form
5 pages
2020 WSC Resource Questions
No ratings yet
2020 WSC Resource Questions
12 pages
SGT5 PAC 4000F Gas Turbine Package Appli Part1 PDF
100% (3)
SGT5 PAC 4000F Gas Turbine Package Appli Part1 PDF
49 pages
Diabetic Meal Plan - 1200 Calories
0% (1)
Diabetic Meal Plan - 1200 Calories
2 pages
Pathor Kuchi Leaf in Power Production
No ratings yet
Pathor Kuchi Leaf in Power Production
13 pages
Pricelist Komponen 30 Okt 2023
No ratings yet
Pricelist Komponen 30 Okt 2023
62 pages
Chapter 8
No ratings yet
Chapter 8
4 pages
Sandeep Dahiya: EPC Project Management
No ratings yet
Sandeep Dahiya: EPC Project Management
4 pages
The Invisible Light
No ratings yet
The Invisible Light
41 pages
Gypsy Origins All The Pretty Monsters Boo - Kristy Cunning
No ratings yet
Gypsy Origins All The Pretty Monsters Boo - Kristy Cunning
344 pages
Electric Flat Iron
No ratings yet
Electric Flat Iron
16 pages
Membrane Rafts 1
No ratings yet
Membrane Rafts 1
18 pages
Math Ed
No ratings yet
Math Ed
7 pages
Tieng Anh
No ratings yet
Tieng Anh
19 pages
Modeling of Mechanical Systems
No ratings yet
Modeling of Mechanical Systems
99 pages
DEALER MITRA SURYA NEW 20 APRIL 2020 Asus, Lenovo, HP, Dell, Acer, Msi, Avita
No ratings yet
DEALER MITRA SURYA NEW 20 APRIL 2020 Asus, Lenovo, HP, Dell, Acer, Msi, Avita
254 pages
Prepare Vegetable Dishes QUARTER 2 COOKERY 10
No ratings yet
Prepare Vegetable Dishes QUARTER 2 COOKERY 10
20 pages
11em Physics BBMCQ 2025-1
No ratings yet
11em Physics BBMCQ 2025-1
12 pages
Manual de Instrucciones GA 11-22 - AII 229653 PDF
100% (2)
Manual de Instrucciones GA 11-22 - AII 229653 PDF
38 pages
3721711053855159719619592
No ratings yet
3721711053855159719619592
40 pages
Your Grace Is Enough
No ratings yet
Your Grace Is Enough
2 pages
Essential of Electrosurgery
No ratings yet
Essential of Electrosurgery
31 pages
Math Ex 1
No ratings yet
Math Ex 1
10 pages
Profiles of Iconic Fashion Designers
No ratings yet
Profiles of Iconic Fashion Designers
8 pages
Controlled Rectifiers Tutorial PDF
No ratings yet
Controlled Rectifiers Tutorial PDF
5 pages
Luk Repset 2ct Ford en
No ratings yet
Luk Repset 2ct Ford en
44 pages