0% found this document useful (0 votes)

46 views31 pages

Data Preprocessing (Sagar)

Uploaded by

Shubham Singh Rajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views31 pages

Data Preprocessing (Sagar)

Uploaded by

Shubham Singh Rajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Name- sagar

kumar
roll no-40822028
bsc data analytics

1
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
2
Data Quality: Why Preprocess the
Data?

 Measures for data quality: A multidimensional view


Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not,
dangling, …

Timeliness: timely update?

Believability: how trustable the data are correct?

Interpretability: how easily the data can be
understood?
3
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
4
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
5
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error

incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
6
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
7
How to Handle Missing
Data?
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective when
the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
8
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records

incomplete data

inconsistent data
9
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)

10
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools


Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

11
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
12
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
13
Handling Redundancy in Data
Integration

 Redundant data occur often when integration of

multiple databases

Object identification: The same attribute or object
may have different names in different databases

Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
14
Chi-Square Calculation: An
Example

Play Not play Sum

chess chess (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis

are expected counts calculated based on the data
distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840
 It shows that like_science_fiction and play_chess
are correlated in the group
15
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

16
Correlation (viewed as linear
relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize
data objects, A and B, and then take their
dot product
a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlation( A, B)  A' B'

17
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:
where n is the number of tuples, and are the respective mean
or expected values of A and B, A σ andBσ are the respective
A B

standard deviation of A and B.

 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply
independence
18
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one

week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?

E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume but yet produces the same
(or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant

attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)


Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression
20
Data Reduction 1: Dimensionality
Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

21
Mapping Data to a New Space
 Fourier transform
 Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

22
What Is Wavelet Transform?
 Decomposes a signal into
different frequency subbands
 Applicable to n-
dimensional signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more distinguishable
 Used for image compression

23
Wavelet
Transformation
Haar2 Daubechie4
 Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

24
Wavelet Decomposition
 Wavelets: A math tool for space-efficient
hierarchical decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^
= [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
 Compression: many small detail coefficients can
be replaced by 0’s, and only the significant
coefficients are retained

25
Haar Wavelet Coefficients
Coefficient
Hierarchical “Supports”
2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
+
0.5
- +
0
- 0
+
-
0 -1 -1 0
+
-
+ + 0
- - + - + -
-1
+
-+
-+
2 2 0 2 3 5 4 4
-1
Original frequency distribution 0 -+
26
-
Why Wavelet Transform?
 Use hat-shape filters
 Emphasize region where points cluster

 Suppress weaker information in their boundaries

 Effective removal of outliers

 Insensitive to noise, insensitive to input order

 Multi-resolution
 Detect arbitrary shaped clusters at different

scales
 Efficient
 Complexity O(N)

 Only applicable to low dimensional data

27
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x1
28
Principal Component Analysis
(Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data

Normalize input data: Each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components

Each input data (vector) is a linear combination of the k principal
component vectors

The principal components are sorted in order of decreasing
“significance” or strength

Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
 Works for numeric data only

29
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information
contained in one or more other attributes
 E.g., purchase price of a product and the
amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the
data mining task at hand
 E.g., students' ID is often irrelevant to the task
of predicting students' GPA

30
Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
31

Lec 7
No ratings yet
Lec 7
45 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Mining
No ratings yet
Mining
63 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
56 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DM Merged
No ratings yet
DM Merged
169 pages
Data Quality and Preprocessing Techniques
No ratings yet
Data Quality and Preprocessing Techniques
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing for Regression Analysis
No ratings yet
Data Preprocessing for Regression Analysis
56 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing Overview and Techniques
100% (1)
Data Preprocessing Overview and Techniques
41 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Module 2
No ratings yet
Module 2
62 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preprocessing: Discretization Techniques
No ratings yet
Data Preprocessing: Discretization Techniques
63 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Lec 3
No ratings yet
Lec 3
31 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
3 Processing
No ratings yet
3 Processing
79 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Major Tasks in Data Preprocessing
No ratings yet
Major Tasks in Data Preprocessing
62 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
CUETScoreCard 233510577176
No ratings yet
CUETScoreCard 233510577176
1 page
Impact of Sleep On Daily Life Assignment
No ratings yet
Impact of Sleep On Daily Life Assignment
202 pages
Sales and COGS Data by Market 2010-2011
No ratings yet
Sales and COGS Data by Market 2010-2011
272 pages
DAS05 (Venn Diagram) Solutions Part-1
No ratings yet
DAS05 (Venn Diagram) Solutions Part-1
5 pages
CUETApplicationForm 233510577176
No ratings yet
CUETApplicationForm 233510577176
1 page
Lecture 13 & 14
No ratings yet
Lecture 13 & 14
573 pages
Data Structure and Algorithm CO
No ratings yet
Data Structure and Algorithm CO
4 pages
Data 1690047573679
No ratings yet
Data 1690047573679
13 pages
Course Outline EVS-II Sem3 (UG)
No ratings yet
Course Outline EVS-II Sem3 (UG)
3 pages
DSA - Practical - File (1) Sagar Kumar
No ratings yet
DSA - Practical - File (1) Sagar Kumar
35 pages
Sleep Patterns by Age and Gender
No ratings yet
Sleep Patterns by Age and Gender
64 pages
Data 1690047616734
No ratings yet
Data 1690047616734
3 pages
MDCM Sagar Assignment
No ratings yet
MDCM Sagar Assignment
15 pages
DSEU Admit Card
No ratings yet
DSEU Admit Card
1 page
Overview of Popular NLP Libraries
No ratings yet
Overview of Popular NLP Libraries
9 pages
Kumar, Shubham
No ratings yet
Kumar, Shubham
5 pages
Financial Course Certificate
No ratings yet
Financial Course Certificate
1 page
Ehositalap 171017211724
No ratings yet
Ehositalap 171017211724
15 pages
Linear and Circular Arrangements Questions
No ratings yet
Linear and Circular Arrangements Questions
1 page
B.Sc Data Analytics Excel Guide
No ratings yet
B.Sc Data Analytics Excel Guide
31 pages
Exam 1
No ratings yet
Exam 1
6 pages
K
No ratings yet
K
11 pages
Delhi AAY Card Application
No ratings yet
Delhi AAY Card Application
4 pages
Name - Sameer Ali PPT of Machine Learning
No ratings yet
Name - Sameer Ali PPT of Machine Learning
9 pages
Unit-4 Containers and Docker
No ratings yet
Unit-4 Containers and Docker
44 pages
Project Report Minor Project
No ratings yet
Project Report Minor Project
15 pages
Name - Sameer Ali
No ratings yet
Name - Sameer Ali
11 pages
EWS Income and Assets Certificate Declaration
No ratings yet
EWS Income and Assets Certificate Declaration
1 page
Practical Exam
No ratings yet
Practical Exam
7 pages
Output Boe
No ratings yet
Output Boe
2 pages
Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems
No ratings yet
Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems
13 pages
Power BI AI
No ratings yet
Power BI AI
20 pages
csv2tcxml User Guide
No ratings yet
csv2tcxml User Guide
43 pages
Hotel Management System
No ratings yet
Hotel Management System
7 pages
Resume Interview Questions
No ratings yet
Resume Interview Questions
2 pages
C HRHPC 1911-Questions
No ratings yet
C HRHPC 1911-Questions
4 pages
Tafj-Postgresql 13 Install
100% (1)
Tafj-Postgresql 13 Install
33 pages
Comprehensive DBMS Exam Questions Guide
No ratings yet
Comprehensive DBMS Exam Questions Guide
3 pages
Welcome To: JDBC Programming I
No ratings yet
Welcome To: JDBC Programming I
27 pages
Dbms Notes
100% (1)
Dbms Notes
28 pages
Mining Frequent Patterns and Correlations
No ratings yet
Mining Frequent Patterns and Correlations
100 pages
Agri Buzz
No ratings yet
Agri Buzz
9 pages
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
33% (3)
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
17 pages
Insurance System Design Guide
No ratings yet
Insurance System Design Guide
57 pages
SQL Server Assessment Report
100% (1)
SQL Server Assessment Report
18 pages
PL 300 Dumpsbase
No ratings yet
PL 300 Dumpsbase
221 pages
Distributed Machine Learning With PySpark
75% (4)
Distributed Machine Learning With PySpark
830 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Mobile Computing Answer
No ratings yet
Mobile Computing Answer
53 pages
Resume SSWETHA
No ratings yet
Resume SSWETHA
1 page
Google AI Platform Overview and Tools
No ratings yet
Google AI Platform Overview and Tools
5 pages
Unit - 1 Introduction To Database Management System
No ratings yet
Unit - 1 Introduction To Database Management System
40 pages
Week - 5
No ratings yet
Week - 5
7 pages
Big Data Important Questions
No ratings yet
Big Data Important Questions
6 pages
E-Resources 2025
No ratings yet
E-Resources 2025
2 pages
Azure AI Language Service Overview
No ratings yet
Azure AI Language Service Overview
4 pages
B14 SQL Server DBA Notes
100% (3)
B14 SQL Server DBA Notes
403 pages
CSC 3326 Syllabus
No ratings yet
CSC 3326 Syllabus
4 pages
MySQL Aggregate Functions and Group by - Exercises, Practice, Solution
100% (1)
MySQL Aggregate Functions and Group by - Exercises, Practice, Solution
35 pages
Class 12 CS CH 15 Interface With Python Notes by Anjeev Singh Sir
No ratings yet
Class 12 CS CH 15 Interface With Python Notes by Anjeev Singh Sir
13 pages

Data Preprocessing (Sagar)

Uploaded by

Data Preprocessing (Sagar)

Uploaded by

Name- sagar

 Data Preprocessing: An Overview

 Measures for data quality: A multidimensional view

 Data Preprocessing: An Overview

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Data Preprocessing: An Overview

 Redundant data occur often when integration of

Play Not play Sum

 Χ2 (chi-square) calculation (numbers in parenthesis

b'k (bk  mean( B )) / std ( B )

correlation( A, B)  A' B'

standard deviation of A and B.

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one

Two Sine Waves Two Sine Waves + Noise Frequency

 Suppress weaker information in their boundaries

 Effective removal of outliers

 Only applicable to low dimensional data

You might also like