0% found this document useful (0 votes)
15 views10 pages

Wk. 4. Exploring Data (12-05-2021)

Uploaded by

walid49161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Wk. 4. Exploring Data (12-05-2021)

Uploaded by

walid49161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Summer 2022

Data Mining and Machine


Learning (CSE 321)

Topic – 4: Exploring Data

Course Teacher:
Md. Aynul Hasan Nahid
Lecturer
Department of Computer Science and Engineering
Daffodil International University
2

Recommended Reading

• “Introduction to Data Mining,” Pang-Ning


Tan, Michael Steinbach and Vipin Kumar,
Addison Wesley, 2006.
☞ Chapter 3 (Exploring Data)

2
What is data exploration?

A preliminary exploration of the data to


better understand its characteristics.
● Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
◆ People can recognize patterns not captured by data analysis
tools

● Related to the area of Exploratory Data Analysis (EDA)


– Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
Techniques Used In Data Exploration

● In EDA, as originally defined by Tukey


– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory

● In our discussion of data exploration, we focus on


– Summary statistics
Iris Sample Data Set

● Many of the exploratory data techniques are illustrated


with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
◆ Setosa
◆ Virginica
◆ Versicolour

– Four (non-class) attributes


◆ Sepal width and length
◆ Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Summary Statistics

● Summary statistics are numbers that summarize


properties of the data

– Summarized properties include frequency, location and


spread
◆ Examples: location - mean
spread - standard deviation

– Most summary statistics can be calculated in a single


pass through the data
Frequency and Mode

● The frequency of an attribute value is the


percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
● The mode of an attribute is the most frequent
attribute value
● The notions of frequency and mode are typically
used with categorical data
Percentiles

● For continuous data, the notion of a percentile is


more useful.

Given an ordinal or continuous attribute x and a


number p between 0 and 100, the pth percentile is
a value xp of x such that p% of the observed
values of x are less than xp.

● For instance, the 50th percentile is the value


such that 50% of all values of x are less than .
Measures of Location: Mean and Median

● The mean is the most common measure of the


location of a set of points.
● However, the mean is very sensitive to outliers.
● Thus, the median or a trimmed mean is also
commonly used.
Measures of Spread: Range and Variance

● Range is the difference between the max and min


● The variance or standard deviation is the most
common measure of the spread of a set of points.

● However, this is also sensitive to outliers, so that


other measures are often used.

Visually shown in box plots


(and in SEEQ results)

You might also like