Summer 2022
Data Mining and Machine
Learning (CSE 321)
Topic – 4: Exploring Data
Course Teacher:
Md. Aynul Hasan Nahid
Lecturer
Department of Computer Science and Engineering
Daffodil International University
2
Recommended Reading
• “Introduction to Data Mining,” Pang-Ning
Tan, Michael Steinbach and Vipin Kumar,
Addison Wesley, 2006.
☞ Chapter 3 (Exploring Data)
2
What is data exploration?
A preliminary exploration of the data to
better understand its characteristics.
● Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
◆ People can recognize patterns not captured by data analysis
tools
● Related to the area of Exploratory Data Analysis (EDA)
– Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
Techniques Used In Data Exploration
● In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
● In our discussion of data exploration, we focus on
– Summary statistics
Iris Sample Data Set
● Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
◆ Setosa
◆ Virginica
◆ Versicolour
– Four (non-class) attributes
◆ Sepal width and length
◆ Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Summary Statistics
● Summary statistics are numbers that summarize
properties of the data
– Summarized properties include frequency, location and
spread
◆ Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated in a single
pass through the data
Frequency and Mode
● The frequency of an attribute value is the
percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
● The mode of an attribute is the most frequent
attribute value
● The notions of frequency and mode are typically
used with categorical data
Percentiles
● For continuous data, the notion of a percentile is
more useful.
Given an ordinal or continuous attribute x and a
number p between 0 and 100, the pth percentile is
a value xp of x such that p% of the observed
values of x are less than xp.
● For instance, the 50th percentile is the value
such that 50% of all values of x are less than .
Measures of Location: Mean and Median
● The mean is the most common measure of the
location of a set of points.
● However, the mean is very sensitive to outliers.
● Thus, the median or a trimmed mean is also
commonly used.
Measures of Spread: Range and Variance
● Range is the difference between the max and min
● The variance or standard deviation is the most
common measure of the spread of a set of points.
● However, this is also sensitive to outliers, so that
other measures are often used.
Visually shown in box plots
(and in SEEQ results)