October 15,
2014
1
Data Mining:
Concepts and
Techniques
October 15,
2014
2
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
October 15,
2014
3
Why Data Preprocessing?
Data in the real orld is dirty
incomplete: lac!ing attri"ute #alues$ lac!ing
certain attri"utes of interest$ or containing only
aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes
or names
%o quality data$ no quality mining results&
'uality decisions must "e "ased on quality data
Data arehouse needs consistent integration of
quality data
October 15,
2014
4
Multi(Dimensional Measure of Data
'uality
) ell(accepted multidimensional #ie:
)ccuracy
Completeness
Consistency
Timeliness
*elie#a"ility
+alue added
,nterpreta"ility
)ccessi"ility
*road categories:
intrinsic$ conte-tual$ representational$ and
accessi"ility.
October 15,
2014
5
Ma/or Tas!s in Data
Preprocessing
Data cleaning
0ill in missing #alues$ smooth noisy data$ identify or
remo#e outliers$ and resol#e inconsistencies
Data integration
,ntegration of multiple data"ases$ data cu"es$ or 1les
Data transformation
%ormalization and aggregation
Data reduction
2"tains reduced representation in #olume "ut produces the
same or similar analytical results
Data discretization
Part of data reduction "ut ith particular importance$
especially for numerical data
October 15,
2014
6
0orms of data
preprocessing
October 15,
2014
7
Data Cleaning
Data cleaning tas!s
0ill in missing #alues
,dentify outliers and smooth out noisy
data
Correct inconsistent data
October 15,
2014
8
Missing Data
Data is not alays a#aila"le
3.g.$ many tuples ha#e no recorded #alue for se#eral
attri"utes$ such as customer income in sales data
Missing data may "e due to
equipment malfunction
inconsistent ith other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not "e considered important at the
time of entry
not register history or changes of the data
Missing data may need to "e inferred.
October 15,
2014
9
4o to 4andle Missing
Data?
,gnore the tuple: usually done hen class la"el is missing
5assuming the tas!s in classi1cation6not e7ecti#e hen the
percentage of missing #alues per attri"ute #aries considera"ly.
0ill in the missing #alue manually: tedious 8 infeasi"le?
9se a glo"al constant to 1ll in the missing #alue: e.g.$
:un!non;$ a ne class?&
9se the attri"ute mean to 1ll in the missing #alue
9se the attri"ute mean for all samples "elonging to the same
class to 1ll in the missing #alue: smarter
9se the most pro"a"le #alue to 1ll in the missing #alue:
inference("ased such as *ayesian formula or decision tree
October 15,
2014
10
%oisy Data
%oise: random error or #ariance in a measured #aria"le
,ncorrect attri"ute #alues may due to
faulty data collection instruments
data entry pro"lems
data transmission pro"lems
technology limitation
inconsistency in naming con#ention
2ther data pro"lems hich requires data cleaning
duplicate records
incomplete data
inconsistent data
October 15,
2014
11
4o to 4andle %oisy Data?
*inning method:
1rst sort data and partition into 5equi(depth< "ins
then one can smooth "y "in means$ smooth "y
"in median$ smooth "y "in "oundaries$ etc.
Clustering
detect and remo#e outliers
Com"ined computer and human inspection
detect suspicious #alues and chec! "y human
=egression
smooth "y 1tting the data into regression
functions
October 15,
2014
12
Simple Discretization Methods:
*inning
3qual(idth 5distance< partitioning:
,t di#ides the range into N inter#als of equal size:
uniform grid
if A and B are the loest and highest #alues of the
attri"ute$ the idth of inter#als ill "e: W > 5B(A<?N.
The most straightforard
*ut outliers may dominate presentation
S!eed data is not handled ell.
3qual(depth 5frequency< partitioning:
,t di#ides the range into N inter#als$ each containing
appro-imately same num"er of samples
@ood data scaling
Managing categorical attri"utes can "e tric!y.
October 15,
2014
13
*inning Methods for Data
Smoothing
A Sorted data for price 5in dollars<: B$ C$ D$ EF$ GE$ GE$ GB$ GF$ GH$
GC$ GD$ 3B
A Partition into 5equi(depth< "ins:
( *in E: B$ C$ D$ EF
( *in G: GE$ GE$ GB$ GF
( *in 3: GH$ GC$ GD$ 3B
A Smoothing "y "in means:
( *in E: D$ D$ D$ D
( *in G: G3$ G3$ G3$ G3
( *in 3: GD$ GD$ GD$ GD
A Smoothing "y "in "oundaries:
( *in E: B$ B$ B$ EF
( *in G: GE$ GE$ GF$ GF
( *in 3: GH$ GH$ GH$ 3B
October 15,
2014
14
Cluster )nalysis
October 15,
2014
15
=egression
x
y
y = x + 1
X1
Y1
Y1
October 15,
2014
16
Data ,ntegration
Data integration:
com"ines data from multiple sources into a
coherent store
Schema integration
integrate metadata from di7erent sources
3ntity identi1cation pro"lem: identify real orld
entities from multiple data sources$ e.g.$ ).cust(id
*.cust(I
Detecting and resol#ing data #alue conJicts
for the same real orld entity$ attri"ute #alues
from di7erent sources are di7erent
possi"le reasons: di7erent representations$
di7erent scales$ e.g.$ metric #s. *ritish units
October 15,
2014
17
4andling =edundant
Data in Data
,ntegration
=edundant data occur often hen integration of
multiple data"ases
The same attri"ute may ha#e di7erent names in
di7erent data"ases
2ne attri"ute may "e a :deri#ed; attri"ute in
another ta"le$ e.g.$ annual re#enue
=edundant data may "e a"le to "e detected "y
correlational analysis
Careful integration of the data from multiple sources
may help reduce?a#oid redundancies and
inconsistencies and impro#e mining speed and quality
October 15,
2014
18
Data Transformation
Smoothing: remo#e noise from data
)ggregation: summarization$ data cu"e construction
@eneralization: concept hierarchy clim"ing
%ormalization: scaled to fall ithin a small$ speci1ed
range
min(ma- normalization
z(score normalization
normalization "y decimal scaling
)ttri"ute?feature construction
%e attri"utes constructed from the gi#en ones
October 15,
2014
19
Data Transformation:
%ormalization
min(ma- normalization
z(score normalization
normalization "y decimal scaling
A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' +
=
A
A
dev stand
mean v
v
_
'
=
j
v
v
10
' = Were j !" te "#$%%e"t !&te'er "(c t$t )$x(* *)+1
' v
October 15,
2014
20
Data =eduction
Strategies
Warehouse may store tera"ytes of data: Comple-
data analysis?mining may ta!e a #ery long time to
run on the complete data set
Data reduction
2"tains a reduced representation of the data set
that is much smaller in #olume "ut yet produces
the same 5or almost the same< analytical results
Data reduction strategies
Data cu"e aggregation
Dimensionality reduction
%umerosity reduction
Discretization and concept hierarchy generation
October 15,
2014
21
Data Cu"e )ggregation
The loest le#el of a data cu"e
the aggregated data for an indi#idual entity of
interest
e.g.$ a customer in a phone calling data arehouse.
Multiple le#els of aggregation in data cu"es
0urther reduce the size of data to deal ith
=eference appropriate le#els
9se the smallest representation hich is enough to
sol#e the tas!
'ueries regarding aggregated information should "e
ansered using data cu"e$ hen possi"le
October 15,
2014
22
Dimensionality =eduction
0eature selection 5i.e.$ attri"ute su"set selection<:
Select a minimum set of features such that the
pro"a"ility distri"ution of di7erent classes gi#en the
#alues for those features is as close as possi"le to the
original distri"ution gi#en the #alues of all features
reduce I of patterns in the patterns$ easier to
understand
4euristic methods 5due to e-ponential I of choices<:
step(ise forard selection
step(ise "ac!ard elimination
com"ining forard selection and "ac!ard elimination
decision(tree induction
October 15,
2014
23
Example of Decision Tree Induction
,&!t!$% $ttr!b(te "et-
./1, /2, /3, /4, /5, /60
/4 1
/11
/61
2%$"" 1
2%$"" 2
2%$"" 1
2%$"" 2
3 4e5(ce5 $ttr!b(te "et- ./1, /4, /60
October 15,
2014
25
Data Compression
String compression
There are e-tensi#e theories and ell(tuned
algorithms
Typically lossless
*ut only limited manipulation is possi"le ithout
e-pansion
)udio?#ideo compression
Typically lossy compression$ ith progressi#e
re1nement
Sometimes small fragments of signal can "e
reconstructed ithout reconstructing the hole
Time sequence is not audio
Typically short and #ary sloly ith time
October 15,
2014
26
Data Compression
Or!'!&$% 6$t$ 2o#7re""e5
6$t$
%o""%e""
Or!'!&$% 6$t$
/77rox!#$te5
%
o
"
"
y
October 15,
2014
27
Wa#elet Transforms
Discrete a#elet transform 5DWT<: linear signal
processing
Compressed appro-imation: store only a small fraction
of the strongest of the a#elet coeKcients
Similar to discrete 0ourier transform 5D0T<$ "ut "etter
lossy compression$ localized in space
Method:
Length$ L$ must "e an integer poer of G 5padding ith Ms$ hen
necessary<
3ach transform has G functions: smoothing$ di7erence
)pplies to pairs of data$ resulting in to set of data of length L?G
)pplies to functions recursi#ely$ until reaches the desired
length
4aarG Dau"echieB
October 15,
2014
28
@i#en N data #ectors from k(dimensions$ 1nd c
<= k orthogonal #ectors that can "e "est used
to represent data
The original data set is reduced to one
consisting of % data #ectors on c principal
components 5reduced dimensions<
3ach data #ector is a linear com"ination of the c
principal component #ectors
Wor!s for numeric data only
9sed hen the num"er of dimensions is large
Principal Component )nalysis
October 15,
2014
29
X1
X2
Y1
Y2
Principal Component Analysis
October 15,
2014
30
%umerosity =eduction
Parametric methods
)ssume the data 1ts some model$ estimate
model parameters$ store only the parameters$
and discard the data 5e-cept possi"le outliers<
Log(linear models: o"tain #alue at a point in m(
D space as the product on appropriate marginal
su"spaces
%on(parametric methods
Do not assume models
Ma/or families: histograms$ clustering$ sampling
October 15,
2014
31
=egression and Log(Linear
Models
Linear regression: Data are modeled to 1t a straight
line
2ften uses the least(square method to 1t the line
Multiple regression: allos a response #aria"le N to
"e modeled as a linear function of multidimensional
feature #ector
Log(linear model: appro-imates discrete
multidimensional pro"a"ility distri"utions
October 15,
2014
Linear regression: Y = + X
To parameters $ and specify the line and
are to "e estimated "y using the data at hand.
using the least squares criterion to the !non
#alues of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can "e transformed
into the a"o#e.
Log(linear models:
The multi(ay ta"le of /oint pro"a"ilities is
appro-imated "y a product of loer(order
ta"les.
Pro"a"ility: p(a, b, c, d) = ab acad bcd
=egress )nalysis and Log(
Linear Models
October 15,
2014
33
4istograms
) popular data
reduction technique
Di#ide data into "uc!ets
and store a#erage
5sum< for each "uc!et
Can "e constructed
optimally in one
dimension using
dynamic programming
=elated to quantization
pro"lems.
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0
October 15,
2014
34
Clustering
Partition data set into clusters$ and one can store
cluster representation only
Can "e #ery e7ecti#e if data is clustered "ut not if
data is :smeared;
Can ha#e hierarchical clustering and "e stored in
multi(dimensional inde- tree structures
There are many choices of clustering de1nitions
and clustering algorithms$ further detailed in
Chapter C
October 15,
2014
35
Sampling
)llo a mining algorithm to run in comple-ity that is
potentially su"(linear to the size of the data
Choose a representati#e su"set of the data
Simple random sampling may ha#e #ery poor
performance in the presence of s!e
De#elop adapti#e sampling methods
Strati1ed sampling:
)ppro-imate the percentage of each class 5or
su"population of interest< in the o#erall data"ase
9sed in con/unction ith s!eed data
Sampling may not reduce data"ase ,?2s 5page at a
time<.
October 15,
2014
36
Sampling
8
4
8
W
O
4
(
"
!
#
7
%
e
r
$
&
5
o
#
"
$
#
7
%
e
9
!
t
o
(
t
r
e
7
%
$
c
e
#
e
&
t
)
8
4
8
W
4
4$9 6$t$
October 15,
2014
37
Sampling
4$9 6$t$
2%("ter:8tr$t!;!e5 8$#7%e
October 15,
2014
38
4ierarchical =eduction
9se multi(resolution structure ith di7erent degrees
of reduction
4ierarchical clustering is often performed "ut tends
to de1ne partitions of data sets rather than :clusters;
Parametric methods are usually not amena"le to
hierarchical representation
4ierarchical aggregation
)n inde- tree hierarchically di#ides a data set into
partitions "y #alue range of some attri"utes
3ach partition can "e considered as a "uc!et
Thus an inde- tree ith aggregates stored at each
node is a hierarchical histogram
October 15,
2014
39
Discretization
Three types of attri"utes:
%ominal 6 #alues from an unordered set
2rdinal 6 #alues from an ordered set
Continuous 6 real num"ers
Discretization:
di#ide the range of a continuous attri"ute into
inter#als
Some classi1cation algorithms only accept
categorical attri"utes.
=educe data size "y discretization
Prepare for further analysis
October 15,
2014
40
Discretization and Concept
hierachy
Discretization
reduce the num"er of #alues for a gi#en
continuous attri"ute "y di#iding the range of
the attri"ute into inter#als. ,nter#al la"els can
then "e used to replace actual data #alues.
Concept hierarchies
reduce the data "y collecting and replacing
lo le#el concepts 5such as numeric #alues for
the attri"ute age< "y higher le#el concepts
5such as young$ middle(aged$ or senior<.
October 15,
2014
41
Discretization and concept
hierarchy generation for numeric
data
*inning 5see sections "efore<
4istogram analysis 5see sections "efore<
Clustering analysis 5see sections "efore<
3ntropy("ased discretization
Segmentation "y natural partitioning
October 15,
2014
42
3ntropy(*ased Discretization
@i#en a set of samples S$ if S is partitioned into to
inter#als SE and SG using "oundary T$ the entropy
after partitioning is
The "oundary that minimizes the entropy function
o#er all possi"le "oundaries is selected as a "inary
discretization.
The process is recursi#ely applied to partitions
o"tained until some stopping criterion is met$ e.g.$
3-periments sho that it may reduce data size and
impro#e classi1cation accuracy
E S T
S
Ent
S
Ent
S
S
S
S
( , )
* *
* *
( )
* *
* *
( ) = +
1
1
2
2
Ent S E T S ( ) ( , ) >
October 15,
2014
43
Segmentation "y natural
partitioning
3(B(F rule can "e used to segment numeric data into
relati#ely uniform$ :natural; inter#als.
A ,f an inter#al co#ers 3$ H$ O or D distinct #alues at
the most signi1cant digit$ partition the range into
3 equi(idth inter#als
A ,f it co#ers G$ B$ or C distinct #alues at the most
signi1cant digit$ partition the range into B inter#als
A ,f it co#ers E$ F$ or EM distinct #alues at the most
signi1cant digit$ partition the range into F inter#als
October 15,
2014
44
3-ample of 3(B(F rule
(<=4000 <=5,000)
(<=400 < 0)
(<=400 <
<=300)
(<=300 <
<=200)
(<=200 <
<=100)
(<=100 <
0)
(0 < =1,000)
(0 <
=200)
(=200 <
=400)
(=400 <
=600)
(=600 <
=800) (=800 <
=1,000)
(=2,000 < =5, 000)
(=2,000 <
=3,000)
(=3,000 <
=4,000)
(=4,000 <
=5,000)
(=1,000 < =2, 000)
(=1,000 <
=1,200)
(=1,200 <
=1,400)
(=1,400 <
=1,600)
(=1,600 <
=1,800)
(=1,800 <
=2,000)
#"5=1,000 >o9=<=1,000 ?!'==2,000 8te7 2-
8te7 4-
8te7 1- <=351 <=159 7ro;!t =1,838 =4,700
)!& >o9 (!@e, 5A<t!%e) ?!'(!@e, 95A<0 t!%e) )$x
co(&t
(<=1,000 < =2,000)
(<=1,000 < 0)
(0 <= 1,000)
8te7 3-
(=1,000 < =2,000)
October 15,
2014
45
Concept hierarchy generation for
categorical data
Speci1cation of a partial ordering of attri"utes
e-plicitly at the schema le#el "y users or e-perts
Speci1cation of a portion of a hierarchy "y
e-plicit data grouping
Speci1cation of a set of attri"utes$ "ut not of
their partial ordering
Speci1cation of only a partial set of attri"utes
October 15,
2014
46
Speci1cation of a set of
attri"utes
Concept hierarchy can "e automatically
generated "ased on the num"er of distinct
#alues per attri"ute in the gi#en attri"ute set.
The attri"ute ith the most distinct #alues is
placed at the loest le#el of the hierarchy.
co(&try
7roB!&ce_or_ "t$te
c!ty
"treet
15 5!"t!&ct B$%(e"
65 5!"t!&ct
B$%(e"
3567 5!"t!&ct B$%(e"
674,339 5!"t!&ct B$%(e"
October 15,
2014
47
Summary
Data preparation is a "ig issue for "oth
arehousing and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
) lot a methods ha#e "een de#eloped "ut still an
acti#e area of research