Data Mining
Why Data
Preprocessing:
Introduction
Data Preprocessing - Introduction
What is Data
Pre-processing
Process raw data to
prepare it for another
processing procedure
Transforming raw
data
into an
understandable
format.
Data Preprocessing - Introduction
Why DP
• No quality = No DM
• Decisions = QD
• Data is dirty in real
world
• Noisy
• Incomplete
• Inconsistent
Data Preprocessing - Introduction
Noisy & Inconsistent
Data
Noisy data
Random variance
and/or error in
measurement
Containing errors or
outliers
Data Preprocessing - Introduction
Incomplete data
Lacking attribute
values
Lacking certain
attributes of interest
Containing only
aggregate data
Data Preprocessing - Introduction
Inconsistent data
Containing
discrepancies in codes
or names
Age=“42”
birthday=“03/07/1997”
Rating “1,2,3”,
“A, B, C”
Data Mining
Why data
Preprocessing:
Why is data dirty
Why is data dirty
Reasons
• Noise
• Incompleteness
• Inaccuracy
• Inconsistency
• Timeliness
Why is data dirty
Reason of Noise
• Faulty data
collection
instruments
• Human or computer
error at data entry
• Errors in data
transmission
Why is data dirty
Incompleteness
“Not applicable” data
value when collected
Data collection &
analysis time difference
Human/HW/SW
problems
Why is data dirty
Reasons of Inaccuracy
• Data
transmission
• Inconsistent
naming
conventions,
• Duplicate tuples
• Inaccurate data
collection
Why is data dirty
Inconsistency &
Timeliness
Different data
sources
Functional
dependency violation
Data collection not
on required
frequency
Data Mining
Why data
Preprocessing:
Multi-Dimensional
Measure of Data
Quality
Measuring Data Quality
Measure of Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Measuring Data Quality
Accuracy &
Completeness
Data stored is correct
or not.
Unambiguous.
Assures that all data
for required
information is
available or not.
Measuring Data Quality
Consistency &
Timeliness
Data is in same format
at all time and from
different sources.
Availability of data in
required time.
Measuring Data Quality
Believability & Value
added
How much data can
be trusted that it is
true
What impact new
data has on existing
Measuring Data Quality
Interpretability &
Accessibility
How easily data can
be understood.
How and how easily
data can be
accessed
Data Mining
Data Cleaning
Introduction
Data Cleaning
Introduction
fill in missing values
smooth out noise
identifying outliers
correct
inconsistencies
Data Cleaning
Advantage
False, inaccurate or
misdirecting
conclusions
Make data more
reliable and
accurate
Data Cleaning
Need
Transmission error
Faulty equipment
Error due to different
conventions or scales
Availability of data
Data Mining
Data Cleaning
Missing Data
Missing Data
Missing data
Missing data is
unavailability of
essential data
which is required to
draw a conclusion
or information.
Missing Data
Reasons for Missing
Equipment
malfunction
Inconsistent with
recorded data/deletion
Data not entered
Not register history or
changes of the data
Missing Data
Handling missing values
Ignore the tuple
Fill in the missing
value manually
Fill in automatically
a global constant
Attribute mean
Most probable value
Data Mining
Data Cleaning
Noisy Data
Introduction
Noisy Data Intro
Missing data
Random error or
variance in a
measured variable.
Noisy data can be
expressed as
meaningless or
corrupt data that
cant be understood
by machine.
Noisy Data Intro
Reasons for Missing data
faulty instruments
data entry problems
transmission problems
technology limitation
Inconsistency in
naming convention
Noisy Data Intro
Handling Techniques
Binning
Regression analysis
Outlier analysis in
clustering
Combined computer
and human
inspection
Data Mining
Data Cleaning
Binning
Binning
Binning
Smooth sorted
data by
neighborhood
The sorted values
are distributed
into a number of
buckets or bins.
Binning
Binning Methods
Bin Medians, Bin Boundaries
Data Mining
Data Cleaning
Models
Data Cleaning - Models
Models
Linear Regression
Clustering
Data Cleaning - Models
Linear Regression
Line to fit two attributes
One att to predict other
Fit the data into fns.
Approx fn to capture
imp patterns/values
FN to find data set
values
Data Cleaning - Models
Clustering
Similar values into
groups or clusters
Detect and remove
outliers.
Procedure