Major Tasks in Data Preprocessing
• Data pre processing techniques can improve data quality, thereby
helping to improve the accuracy and efficiency of the subsequent
mining process.
• The major steps involved in data preprocessing are:
– data cleaning,
– data integration,
– data reduction, and
– data transformation. Arya J S 1
Arya J S 2
I. Data Cleaning
• Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Arya J S 3
1. Ways of handling missing values:
A. Ignore the tuple
B. Fill in the missing value manually
C. Use a global constant to fill in the missing value: like “Unknown”
or ∞.
D: Use a measure of central tendency for the attribute (e.g., the
mean or median) to f ill in the missing value.
E: Use the attribute mean or median for all samples belonging to
the same class as the given tuple.
F: Use the most probable value to fill in the missing value
Arya J S 4
2. Noisy Data
• Noise is a random error or variance in a measured variable.
• The following data smoothing techniques to remove the noise:
1. Binning: Binning methods smooth a sorted data value by
consulting its “neighbor hood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or
bins.
Arya J S 5
Binning
Arya J S 6
2. Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
– Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
– Multiple linear regression is an extension of linear regression, where
more than two attributes are involved.
3. Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters
may be considered outliers.
Arya J S 7
Arya J S 8
II. Data Integration
• The merging of data from multiple data stores.
• This process involves identifying and accessing the different data
sources, mapping the data to a common format.
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent
data mining process.
Arya J S 9
III. Data Reduction
• Data reduction is a technique used in data mining to reduce
the size of a dataset while still preserving the most important
information.
• That is, mining on the reduced data set should be more
efficient yet produce the same (or almost the same) analytical
results.
Arya J S 10
Overview of Data Reduction Strategies
• Data reduction strategies include dimensionality reduction,
numerosity reduction, and data compression.
• Dimensionality reduction is the process of reducing the
number of random variables or attributes under consideration.
• Attribute subset selection is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed
Arya J S 11
Overview of Data Reduction Strategies
• Numerosity reduction techniques replace the original data
volume by alternative, smaller forms of data representation.
• These techniques may be parametric or non parametric.
• Regression is an example. Nonparametric methods for storing
reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.
Arya J S 12
Overview of Data Reduction Strategies
• In data compression, transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
• If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called
lossless.
• If, instead, we can reconstruct only an approximation of the
original data, then the data reduction is called lossy.
Arya J S 13
IV. Data Transformation and Data Discretization
• In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
Data discretization, a form of data transformation.
• In data transformation, the data are transformed or
consolidated into forms appropriate for mining. Strategies for
data transformation include the following:
Arya J S 14
Data Transformation Strategies Overview
1. Smoothing, which removes noise from the data.
• Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts.
Arya J S 15
4. Normalization, where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0
5. Discretization, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior).
6. Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to higher-level
concepts, like city or country.
Arya J S 16
Arya J S 17
• Discretization techniques can be categorized based on how the
discretization is per formed, such as whether it uses class
information or which direction it proceeds (i.e., top-down vs.
bottom-up).
• If the discretization process uses class information, then we say
it is supervised discretization. Otherwise, it is unsupervised.
• Data discretization and concept hierarchy generation are also
forms of data reduc tion.
Arya J S 18