0% found this document useful (0 votes)

68 views31 pages

Data Mining

The document discusses various techniques for preprocessing data in order to improve its quality for data mining purposes, including data cleaning techniques to handle missing values, noisy data, and inconsistencies. It also covers data integration topics such as resolving semantic heterogeneity across multiple data sources and analyzing and removing redundancy. The overall goal of data preprocessing is to produce a reduced and cleaner representation of the data to improve the effectiveness of subsequent data mining and analysis.

Uploaded by

mohamedelgohary679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views31 pages

Data Mining

Uploaded by

mohamedelgohary679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining and Business Intelligence

Overview

Data
Data Pre-processing Cleaning

Integration
By
Dr. Nora Shoaip

Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Quiz

Draw the Box-Plot for the following dataset

4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8,
4.4, 4.2, 4.5, 4.4

2
Quiz

21 3
Overview

Databases are highly susceptible to noisy, missing, and

inconsistent data
Low-quality data will lead to low-quality mining results

“How can the data be preprocessed in order to help improve the

quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”

4
Why Preprocess Data?

To satisfy the requirements of the intended use

 Factors of data quality:
◦Accuracy  lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness  lack of due to different design phases, optional attributes
◦Consistency  lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability  how easy the data are understood

5
Major Preprocessing Tasks
That Improve Quality of Data

 Data cleaning  filling in missing values, smoothing noisy data,

identifying or removing outliers, and resolving inconsistencies
 Data integration  include data from multiple sources in your analysis,
map semantic concepts, infer attributes …
 Data reduction  obtain a reduced representation of the data set that
is much smaller in volume, while producing almost the same analytical
results
 Discretization  raw data values for attributes are replaced by ranges
or higher conceptual levels
 Data transformation  normalization

6
Data Cleaning

 Data in the Real World Is Dirty!

◦incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional  Jan. 1 as everyone’s birthday?
7
Data Cleaning

… fill in missing values, smooth out noise while identifying outliers,

and correct inconsistencies in the data

A missing value may not imply an error in the data!

◦e.g. driver’s license number

8
Data Cleaning
Missing Values

 Ignore the tuple  not very effective, unless the tuple contains
several attributes with missing values
 Fill in the missing value manually  time consuming, not
feasible for large data sets
 Use a global constant  replace all missing attribute values by
same value (e.g. unknown)
 may mistakenly think that “unknown” is an interesting concept

9
Data Cleaning
Missing Values

 Use mean or median  For normal (symmetric) data

distributions, the mean is used, while skewed data distribution
should employ the median
 Use mean or median for all samples belonging to the same
class as the given tuple  e.g. mean or median of customers in
a certain age group
 Use the most probable value  using regression, inference-
based tools such as Bayesian formula or decision tree
 Most popular

10
Data Cleaning
Noisy Data

Noise is a random error or variance in a measured

variable

Data smoothing techniques:

1. Binning
2. Regression
3. Outlier Analysis

11
Data Cleaning
Noisy Data

1. Binning  smooth a sorted data value by consulting its

“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins  local
smoothing
◦equal-frequency bins  each bin has same # of values
◦equal-width bins  interval range of values per bin is constant
 Smoothing by bin means  each bin value is replaced by the bin mean
 Smoothing by bin medians  each bin value is replaced by the bin median
 Smoothing by bin boundaries  each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
12
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
13
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
14
Data Cleaning
Noisy Data

2. Regression  Conform data values to a function

◦Linear regression  find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
 Potter’s Wheel  Automated interactive data
cleaning tool

15
Data Integration

 Entity Identification Problem

 Redundancy and correlation analysis
 Tuple duplication
 Tuple duplication
 Data value conflict detection

16
Data Integration

Merging data from multiple data stores

Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

17
Data Integration
Entity Identification Problem

 Schema integration and object matching

 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data

transformation

18
Data Integration
Redundancy and Correlation Analysis

19
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

20
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

21
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

22
Data Integration
Redundancy and Correlation Analysis

23
Data Integration
Redundancy and Correlation Analysis

24
Data Integration
Redundancy and Correlation Analysis

25
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech

point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

26
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

27
Quiz
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Calculate the correlation coefficient. Are these two attributes positively or negatively
correlated? Compute their covariance.
o (Hint: n = 18
o SD for Age and fat are 12.85 and 8.99 respectively
o Mean for Age and fat are 46.44 and 28.78 respectively
o E(age* fat) = 1431.29)
• Partition the data into three bins by each of equal-frequency and equal-width partitioning
• Use smoothing by bin boundaries to smooth these data

28
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

29
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Equal frequency bin for age  Equal width bin for age
o Bin 1= 23,23,27,27,39,41 o Bin 1= 23,23,27,27
o Bin 2= 47,49,50,52,54,54 o Bin 2= 39,41,47,49
o Bin 3= 50,52,54,54, 56, 57,58,58,60,61
o Bin 3= 56,57,58,58,60,61
 Smoothing by boundary
• Smoothing by boundary o Bin 1= 23,23,27,27
o Bin 1= 23,23,23,23,41,41 o Bin 2= 39,41,47,49
o Bin 2= 47,47,47,54,54,54 o Bin 3= 50,50,50,50, 50, 61,61,61,61,61
o Bin 3= 56,56,56,56,61,61

Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Week2 2
No ratings yet
Week2 2
25 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Discretization and Hierarchy Generation
No ratings yet
Data Discretization and Hierarchy Generation
48 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
CH 2
No ratings yet
CH 2
36 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
CH 3
No ratings yet
CH 3
68 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
35 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
60 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Term Test 1 MCQ
100% (1)
Term Test 1 MCQ
18 pages
Data Collection and Types Explained
No ratings yet
Data Collection and Types Explained
90 pages
Ece069 P2
No ratings yet
Ece069 P2
38 pages
Lesson Plan - Math
No ratings yet
Lesson Plan - Math
5 pages
Understanding Skewed Distributions
No ratings yet
Understanding Skewed Distributions
11 pages
Agrc 212 Lecture Three - June 2023 Covered
No ratings yet
Agrc 212 Lecture Three - June 2023 Covered
32 pages
Computational Techniques in Educational Planning
No ratings yet
Computational Techniques in Educational Planning
11 pages
Remote Sensing and Image Interpretation
No ratings yet
Remote Sensing and Image Interpretation
20 pages
Descriptive Statistics Course Guide
No ratings yet
Descriptive Statistics Course Guide
74 pages
CIGRE Technical Brochure On Lightning Parameters For Engineering Applications
No ratings yet
CIGRE Technical Brochure On Lightning Parameters For Engineering Applications
52 pages
335bookf Engg-Mathematics 2019 PDF
No ratings yet
335bookf Engg-Mathematics 2019 PDF
22 pages
Statistical Analysis Exam Questions
No ratings yet
Statistical Analysis Exam Questions
16 pages
Statistical Measures: Mean, Median, Mode
No ratings yet
Statistical Measures: Mean, Median, Mode
5 pages
Fybcom Sem 1 Maths & Stats
No ratings yet
Fybcom Sem 1 Maths & Stats
3 pages
Big Data Analytics Overview and Challenges
No ratings yet
Big Data Analytics Overview and Challenges
147 pages
Data Science Question Bank Updated
No ratings yet
Data Science Question Bank Updated
15 pages
Math 10 Semi-Final Exam Questions
No ratings yet
Math 10 Semi-Final Exam Questions
3 pages
Research in International Business and Finance: Ines Ben Salah Mahdi, Mouna Boujelbène Abbes
No ratings yet
Research in International Business and Finance: Ines Ben Salah Mahdi, Mouna Boujelbène Abbes
11 pages
Class 10 Statistics Test
No ratings yet
Class 10 Statistics Test
3 pages
Da Test 2
No ratings yet
Da Test 2
68 pages
Family Environment and Adolesc
No ratings yet
Family Environment and Adolesc
67 pages
Business Statistics Level 3: LCCI International Qualifications
100% (1)
Business Statistics Level 3: LCCI International Qualifications
22 pages
Class 10 Program With Solution Final 2024-25
No ratings yet
Class 10 Program With Solution Final 2024-25
21 pages
Ultimate Statistics Handnote
No ratings yet
Ultimate Statistics Handnote
19 pages
Understanding Measures of Central Tendency
No ratings yet
Understanding Measures of Central Tendency
6 pages
Ichino Ballatore Fort Babel
100% (1)
Ichino Ballatore Fort Babel
42 pages
How Age Affects The Tusk Circumference of The African Elephant
No ratings yet
How Age Affects The Tusk Circumference of The African Elephant
9 pages
Statistics For Machine Learning Part 01 1719342613
No ratings yet
Statistics For Machine Learning Part 01 1719342613
27 pages
Introduction to Basic Statistical Inference
No ratings yet
Introduction to Basic Statistical Inference
68 pages
Exercise 3
No ratings yet
Exercise 3
2 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining and Business Intelligence

Draw the Box-Plot for the following dataset

Databases are highly susceptible to noisy, missing, and

“How can the data be preprocessed in order to help improve the

To satisfy the requirements of the intended use

 Data cleaning  filling in missing values, smoothing noisy data,

 Data in the Real World Is Dirty!

… fill in missing values, smooth out noise while identifying outliers,

A missing value may not imply an error in the data!

 Use mean or median  For normal (symmetric) data

Noise is a random error or variance in a measured

Data smoothing techniques:

1. Binning  smooth a sorted data value by consulting its

2. Regression  Conform data values to a function

 Entity Identification Problem

Merging data from multiple data stores

 Schema integration and object matching

 can help avoid errors in schema integration and data

Time AllElectronics HighTech

You might also like