0% found this document useful (0 votes)

5 views18 pages

Mod1 - 3 Data Preprocessing

Data preprocessing techniques enhance data quality, improving the accuracy and efficiency of data mining. Major steps include data cleaning, integration, reduction, and transformation, each with specific methods to handle issues like missing values, noise, and data size. Effective preprocessing ensures that the resulting dataset is consistent, manageable, and conducive to insightful analysis.

Uploaded by

donmathew666666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views18 pages

Mod1 - 3 Data Preprocessing

Uploaded by

donmathew666666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Major Tasks in Data Preprocessing

• Data pre processing techniques can improve data quality, thereby

helping to improve the accuracy and efficiency of the subsequent
mining process.
• The major steps involved in data preprocessing are:

– data cleaning,
– data integration,
– data reduction, and
– data transformation. Arya J S 1
Arya J S 2
I. Data Cleaning

• Data cleaning (or data cleansing) routines attempt to fill in missing

values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.

Arya J S 3
1. Ways of handling missing values:

A. Ignore the tuple

B. Fill in the missing value manually
C. Use a global constant to fill in the missing value: like “Unknown”
or ∞.
D: Use a measure of central tendency for the attribute (e.g., the
mean or median) to f ill in the missing value.
E: Use the attribute mean or median for all samples belonging to
the same class as the given tuple.
F: Use the most probable value to fill in the missing value
Arya J S 4
2. Noisy Data

• Noise is a random error or variance in a measured variable.

• The following data smoothing techniques to remove the noise:

1. Binning: Binning methods smooth a sorted data value by

consulting its “neighbor hood,” that is, the values around it.

The sorted values are distributed into a number of “buckets,” or

bins.
Arya J S 5
Binning

Arya J S 6
2. Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
– Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
– Multiple linear regression is an extension of linear regression, where
more than two attributes are involved.

3. Outlier analysis: Outliers may be detected by clustering, for

example, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters
may be considered outliers.
Arya J S 7
Arya J S 8
II. Data Integration
• The merging of data from multiple data stores.

• This process involves identifying and accessing the different data

sources, mapping the data to a common format.

• Careful integration can help reduce and avoid redundancies and

inconsistencies in the resulting data set.

• This can help improve the accuracy and speed of the subsequent
data mining process.

Arya J S 9
III. Data Reduction
• Data reduction is a technique used in data mining to reduce
the size of a dataset while still preserving the most important
information.
• That is, mining on the reduced data set should be more
efficient yet produce the same (or almost the same) analytical
results.

Arya J S 10
Overview of Data Reduction Strategies
• Data reduction strategies include dimensionality reduction,
numerosity reduction, and data compression.
• Dimensionality reduction is the process of reducing the
number of random variables or attributes under consideration.
• Attribute subset selection is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed

Arya J S 11
Overview of Data Reduction Strategies
• Numerosity reduction techniques replace the original data
volume by alternative, smaller forms of data representation.

• These techniques may be parametric or non parametric.

• Regression is an example. Nonparametric methods for storing

reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.
Arya J S 12
Overview of Data Reduction Strategies
• In data compression, transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
• If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called
lossless.
• If, instead, we can reconstruct only an approximation of the
original data, then the data reduction is called lossy.

Arya J S 13
IV. Data Transformation and Data Discretization
• In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
Data discretization, a form of data transformation.
• In data transformation, the data are transformed or
consolidated into forms appropriate for mining. Strategies for
data transformation include the following:

Arya J S 14
Data Transformation Strategies Overview
1. Smoothing, which removes noise from the data.
• Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts.

Arya J S 15
4. Normalization, where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0

5. Discretization, where the raw values of a numeric attribute

(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior).

6. Concept hierarchy generation for nominal data, where

attributes such as street can be generalized to higher-level
concepts, like city or country.

Arya J S 16
Arya J S 17
• Discretization techniques can be categorized based on how the
discretization is per formed, such as whether it uses class
information or which direction it proceeds (i.e., top-down vs.
bottom-up).

• If the discretization process uses class information, then we say

it is supervised discretization. Otherwise, it is unsupervised.

• Data discretization and concept hierarchy generation are also

forms of data reduc tion.

Arya J S 18

Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
Data Mining
No ratings yet
Data Mining
5 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
ML 4
No ratings yet
ML 4
17 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Pre-processing Techniques Explained
No ratings yet
Data Pre-processing Techniques Explained
101 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Module 2
No ratings yet
Module 2
42 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Week 2
No ratings yet
Week 2
96 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Reduction and Integration Techniques
No ratings yet
Data Reduction and Integration Techniques
21 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
05 DS Data Preprocessing - Cleaning
No ratings yet
05 DS Data Preprocessing - Cleaning
14 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Metadata & Data Mining Essentials
No ratings yet
Metadata & Data Mining Essentials
36 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Operating Systems (2018 Syallabus), March 2021 - VISHNU VIJAY MENON
No ratings yet
Operating Systems (2018 Syallabus), March 2021 - VISHNU VIJAY MENON
4 pages
Python Module III
No ratings yet
Python Module III
19 pages
Mod1 Data Warehouse
No ratings yet
Mod1 Data Warehouse
30 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Bias-Variance Tradeoff in Neural Networks
No ratings yet
Bias-Variance Tradeoff in Neural Networks
28 pages
Rainfall-Runoff Relationships and Flood Routing
No ratings yet
Rainfall-Runoff Relationships and Flood Routing
37 pages
Regression Analysis Quiz 1
No ratings yet
Regression Analysis Quiz 1
4 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
Anticrisis Approach To The Provision of The Environmental Sustainability of Economy Elena G Popkova PDF Download
100% (10)
Anticrisis Approach To The Provision of The Environmental Sustainability of Economy Elena G Popkova PDF Download
90 pages
Time Series
No ratings yet
Time Series
10 pages
Regression Analysis and Tests Results
No ratings yet
Regression Analysis and Tests Results
2 pages
Statistics Syllabus
No ratings yet
Statistics Syllabus
11 pages
Master's Written Examination and Solution
No ratings yet
Master's Written Examination and Solution
14 pages
Credit Scoring Model
No ratings yet
Credit Scoring Model
26 pages
Ezac 429
No ratings yet
Ezac 429
10 pages
Project Analysis Solution 1
No ratings yet
Project Analysis Solution 1
5 pages
Time Series & Forecasting Quiz
100% (1)
Time Series & Forecasting Quiz
16 pages
Econometrics Part1 Notes
No ratings yet
Econometrics Part1 Notes
7 pages
Generalized Linear Model: Badr Missaoui
No ratings yet
Generalized Linear Model: Badr Missaoui
35 pages
The Impact of Patient Safety Culture
No ratings yet
The Impact of Patient Safety Culture
9 pages
Yom Institute of Economic Development Joint Master Program With Debre Markos University
No ratings yet
Yom Institute of Economic Development Joint Master Program With Debre Markos University
92 pages
Stata Commands for Data Analysts
No ratings yet
Stata Commands for Data Analysts
3 pages
Understanding Coefficient of Determination
No ratings yet
Understanding Coefficient of Determination
21 pages
GEE Panel Data Models in Stata
No ratings yet
GEE Panel Data Models in Stata
19 pages
Chapter 4 in Managerial Economic
100% (2)
Chapter 4 in Managerial Economic
46 pages
Ahmadfurqan Sinta
No ratings yet
Ahmadfurqan Sinta
15 pages
Introduction to Structural Equation Modeling
No ratings yet
Introduction to Structural Equation Modeling
16 pages
Advance AI & ML Certification Program
100% (1)
Advance AI & ML Certification Program
29 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
12 pages
Thermal Processing Reassessment
No ratings yet
Thermal Processing Reassessment
6 pages
Factors of The Inflation in Nepal: Reem Prasad Humagai
No ratings yet
Factors of The Inflation in Nepal: Reem Prasad Humagai
10 pages
BFN 313-Financial Modelling and Forecasting
No ratings yet
BFN 313-Financial Modelling and Forecasting
3 pages
Advances in Complex Data Modeling and Computational Methods in Statistics
No ratings yet
Advances in Complex Data Modeling and Computational Methods in Statistics
210 pages
Effect of Restaurant Tax
No ratings yet
Effect of Restaurant Tax
12 pages

Mod1 - 3 Data Preprocessing

Uploaded by

Mod1 - 3 Data Preprocessing

Uploaded by

Major Tasks in Data Preprocessing

• Data pre processing techniques can improve data quality, thereby

• Data cleaning (or data cleansing) routines attempt to fill in missing

A. Ignore the tuple

• Noise is a random error or variance in a measured variable.

1. Binning: Binning methods smooth a sorted data value by

The sorted values are distributed into a number of “buckets,” or

3. Outlier analysis: Outliers may be detected by clustering, for

• This process involves identifying and accessing the different data

• Careful integration can help reduce and avoid redundancies and

• These techniques may be parametric or non parametric.

• Regression is an example. Nonparametric methods for storing

5. Discretization, where the raw values of a numeric attribute

6. Concept hierarchy generation for nominal data, where

• If the discretization process uses class information, then we say

• Data discretization and concept hierarchy generation are also

You might also like