0% found this document useful (0 votes)
71 views38 pages

CSCI322 - Lecture 2

The document outlines the importance of data preprocessing in data analysis, detailing various tasks such as data cleaning, integration, reduction, and transformation. It emphasizes the need for data quality measures, including accuracy, completeness, consistency, and timeliness, to ensure reliable analysis. Additionally, it discusses methods for handling missing and noisy data, highlighting strategies like data imputation and the integration of multiple data sources.

Uploaded by

Noha Gamal Eldin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views38 pages

CSCI322 - Lecture 2

The document outlines the importance of data preprocessing in data analysis, detailing various tasks such as data cleaning, integration, reduction, and transformation. It emphasizes the need for data quality measures, including accuracy, completeness, consistency, and timeliness, to ensure reliable analysis. Additionally, it discusses methods for handling missing and noisy data, highlighting strategies like data imputation and the integration of multiple data sources.

Uploaded by

Noha Gamal Eldin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CSCI322: Data

CSCI322: Data
Analysis
Analysis
Lecture 1: Data Types,
Lecture 2: Data Preprocessing I
Collection, Sampling, and
Preprocessing
Dr. Noha Gamal, Dr. Mustafa
Elattar, Dr. Mohamed Nagy
Dr. Mustafa Elattar

CSCI322: Data Analysis


Data Preprocessing

• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Fusion
• Data Transformation and Data Discretization

CSCI322: Data Analysis


Record Employe Last
ID eID Name Age Salary Height updated SVC

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Jane
2 102 35 1.63 Oct-2023 0.25
Smith

Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson

Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White

Michael Oct-2023
5 105 40 $100,000 1.82 0
Brown

Goerge $6000,00
1 101 28 1.82 Oct-2023 x
Marco 0

3
CSCI322: Data Analysis
Un-Interpretable

Record Employe Last


ID eID Name Age Salary Height updated SVC

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Jane UNAVAILABLE 1.63


2 102 35 Oct-2023 0.25
Smith

Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson
UNTIMELY
Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White NOISY/WRONG
UN- Beliviable

Michael Oct-2023
5 105 40 $100,000 1.82 A
Brown
Discrepancy
between Goerge $6000,00 INCONSISTENT
1 101 28 1.82 Oct-2023 X
duplicate records Marco 0
INACCURATE

4
CSCI322: Data Analysis
Data Quality: Why Preprocess the
Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
How to enhance the data quality?, taking into consideration
that, we don’t have the luxury to drop each problematic
record !!
CSCI322: Data Analysis
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

CSCI322: Data Analysis


Data Quality: What is Data Pre-
processing?
• Data preprocessing is essentially the process of cleaning,
integrating, transforming, and organizing raw data into a format
suitable for analysis.
• It acts as the gateway between the raw, often messy, data we collect and the
meaningful insights we seek to extract.

• Raw data is the unprocessed information collected directly from


observations, surveys, or experiments. It's the starting point of our
analysis, often messy and unstructured. Characteristics of raw data
include its originality, lack of organization, and potential presence of
errors or inconsistencies.

7
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

8
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies Student Attenda
ID Name Age nce
• Example: In a students database, some entries in the 001 Alice 20 90%
"Attendance" column are missing. Cleaning involves 002 Bob 21
deciding whether to fill in those missing values or remove 003 Charlie 19 85%
the corresponding rows, ensuring accurate records of
student attendance.
• Data integration
Books_
• Integration of multiple databases, data cubes, or files Student Attenda Borrow
ID Name Age nce ed
• Example: Combining information from the student 001 Alice 20 90% 5
database with data from the library system to create a
002 Bob 21 3
unified view of student profiles, including academic
records and library borrowing history. 003 Charlie 19 85% 2

9
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data reduction Student Math English Physics Chemist
Example: Consider a dataset with scores in multiple subjects for ID Score Score Score ry Score
each student. Using dimensionality reduction techniques like PCA, 001 90 85 88 92
you can represent the essential information with fewer dimensions. 002 78 92 85 80
For instance, transforming scores in Math, English, Physics, and
003 85 88 90 78
Chemistry into a single composite score.
• Dimensionality reduction (Reducing the number of
dimensions (scores in different subjects) while preserving the Student ID Composite Score
most critical information using techniques like Principal 001 2.0
Component Analysis (PCA).) 002 -0.5
003 -1.5
• Numerosity reduction (Reducing detailed data, if not needed
in the analysis (individual student records) to summary
Department Enrollment
information, in this case, the total enrollment in different
Mathematics 150
departments.)
Biology 120
• Data compression (Reducing the size of the data History 100
representation, possibly by storing only essential information
(e.g., student ID, name, GPA) rather than the full dataset.)

Student Addres Phone nationa Student


ID Name s number lity GPA ID Name GPA
001 Alice ********* ********* ********* 3.75
001 Alice 3.75
002 Bob ********* ********* *********
002 Bob
003 Charlie ********* ********* ********* 2.5
003 Charlie 2.5
10
CSCI322: Data Analysis
Engli
Major Tasks in Data Pre-processing
Stud Math sh
ent Scor Scor
• Data transformation and data discretization ID Age e e
Data transformation involves converting the original data into a different format, typically to 001 20 90 85
make it more suitable for analysis or to meet specific requirements, such as, creating new
variables, Normalization, or aggregating (average, or sum,..). Data discretization involves 002 21 92 78
converting continuous data (like ages) into discrete categories
Perf
003 19 85 88
• Normalization Engli orm
Stud Math Total Stud Math sh ance
• Stud
ent Scor Ques ent Scor Scor Leve
ent Normalized ID Age e e l
ID e tions ID Math Score
001 90 100 001 20 90 85 High
001 0.9
002 78 90 Medi
002 0.866 002 21 78 92
This transformation adds a categorical
um va
003 85 95 003 0.894 that simplifies the interpretation of
003 19Discretized
85Dataset:88 High
• Concept hierarchy generation performance
Perfor
Studen Age Math English mance
Course ID Course Name Category t ID Group Score Score Level

Computer 001 20-24 90 85 High


101 CS
skills 002 20-24 78 92
Mediu
m
Presentation 003 15-19 85 88 High11
102
CSCI322: Data Analysis GER
Skills
Outline
• Data Pre-processing: An Overview
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

12
CSCI322: Data Analysis
Data Cleaning

• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

CSCI322: Data Analysis


Incomplete (Missing) Data

• Data is not always available


• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred

CSCI322: Data Analysis


Incomplete (Missing) Data

• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian formula
or decision tree

CSCI322: Data Analysis


How to Handle Missing Data?
Methods other than Re-collecting the data

Handling missing data is a critical aspect of data preprocessing to ensure accurate and
reliable analyses. Here are several strategies to deal with missing data:
1. Data Imputation: Mean/Median/Mode Imputation, Regression Imputation, the
attribute mean, the attribute mean for all samples belonging to the same class:
smarter!.
2. Deletion Strategies: Listwise Deletion/Dropping Rows, Column Deletion
3. Interpolation: Time Series Interpolation: For time-series data, missing values can
be estimated based on the values before and after the missing points.
4. Prediction Models: Machine Learning Models: Train predictive models to estimate
missing values based on the relationships observed in the rest of the data using
Bayesian formula or decision tree for example.
5. Multiple Imputation: Generate Multiple Imputations: Create multiple imputed
datasets and perform analyses on each. Combine results to account for uncertainty
introduced by imputation.
6. Using Domain Knowledge: Expert Input: Seek input from domain experts, Costly
and infeasible.
7. Handling for Categorical Data: Create a Separate Category: For categorical data,
create a separate category for missing values, replace using the most probable
value.
8. Utilizing Software Functions: Built-in Functions: Many programming languages
and statistical software have built-in functions for handling missing data.
16
CSCI322: Data Analysis
Age Gender Height

25 Male 172.72

Noisy Data 32 Female


45 Male
28 Female
37 Male
157.48
180.34
165.1
175.26
22 Female 152.4
29 Male 182.88
40 Female 160.02
33 Male 177.8

• Noise: random error or variance in a measured variable 26 Female


38 Male
162.56
172.72
30 Female 154.94
42 Male 185.42

• Incorrect attribute values may be due to 27 Female


35 Male
167.64
170.18
24 Female 149.86
36 Male 177.8
• faulty data collection instruments 31 Female
44 Male
157.48
180.34
29 Female 165.1

• data entry problems


39 Male 175.26
23 Female 152.4
32 Male 182.88
41 Female 160.02

• data transmission problems 34 Male


25 Female
177.8
162.56
37 Male 172.72
28 Female 154.94

• technology limitation 43 Male


26 Female
185.42
167.64
35 Male 170.18
24 Female 149.86
• inconsistency in naming convention 36 Male
31 Female
177.8
157.48
45 Male 180.34

• Other data problems which require data cleaning


29 Female 165.1
38 Male 175.26
22 Female 152.4
30 Male 182.88

• duplicate records 41 Female


33 Male
160.02
177.8
27 Female 167.64
39 Male 175.26

• incomplete data 23 Female


32 Male
152.4
182.88
43 Female 160.02
34 Male 177.8
• inconsistent data 25 Female
37 Male
162.56
172.72
28 Female 154.94
CSCI322: Data Analysis 150 149.86
Noisy Data
• Binning
• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression
functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
(e.g., deal with possible outliers)
18

CSCI322: Data Analysis


smooth by bin
means for two
different features,
age and height
which helped
noticeably to
reduce the error
caused by the
incorrect or noisy
values
CSCI322: Data Analysis
Data Cleaning as a Process

• Data discrepancy detection violators (e.g., correlation and clustering


• Use metadata (e.g., domain, range, to find outliers)

dependency, distribution) • Data migration and integration

• Check field overloading • Data migration tools: allow

• Check uniqueness rule, consecutive transformations to be specified

rule and null rule • ETL


• Use commercial tools (Extraction/Transformation/Loading)

• Data scrubbing: use simple domain tools: allow users to specify

knowledge (e.g., postal code, spell-check) transformations through a graphical

to detect errors and make corrections user interface

• Integration of the two processes


• Data auditing: by analyzing data to
discover rules and relationship to detect • Iterative and interactive

CSCI322: Data Analysis


Outline
• Data Pre-processing: An Overview
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

21
CSCI322: Data Analysis
Data Integration Challenges

• Data integration:
• Combines data from multiple sources into a coherent storage
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales, e.g.,
metric vs. British units

CSCI322: Data Analysis


Outline
• Data Pre-processing: An Overview
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

23
CSCI322: Data Analysis
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

CSCI322: Data Analysis


Correlation Analysis (Nominal
Data)
• Χ2 (chi-square) test

(Oij#Eij)! 𝑶𝒊. 𝑶.𝒋


𝜒! =∑ where 𝑬𝒊𝒋 =
Eij 𝑺

• The larger the Χ2 value, the more likely the attributes are related
• The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
• The Χ2 statistic tests the hypothesis that X and Y are independent,
that is, there is no correlation between them
• The test is based on a significance level, with degrees of freedom
• If the hypothesis can be rejected, then we say that X and Y are
statistically correlated

CSCI322: Data Analysis


Correlation Analysis (Nominal
Data)

• A contingency table records the frequencies of different attributes.


• Let Oij denote the joint event that attribute X takes on value Xi and attribute Y takes
on value Yj.
• Each and every possible Oij has its own cell (or slot) in the table.
• Also, we have Oi. = Ʃj Oij and O.j = Ʃi Oij .
• The total number of data tuples S = Ʃij Oij .

CSCI322: Data Analysis


Chi-Square Calculation: An
Example
Play chess Not play chess Sum (row)
• Χ2 calculation Like science fiction 250(90) 200(360) 450
(numbers in Not like science fiction 50(210) 1000(840) 1050
parenthesis are Sum(col.) 300 1200 1500

expected counts)
( 250 - 90 ) 2
(50 - 210) 2
( 200 - 360) 2
(1000 - 840) 2
c2 = + + + = 507.93
90 210 360 840
• The degrees of freedom are (2-1)(2-1) = 1
• The Χ2 value needed to reject the hypothesis at the 0.001 significance
level is 10.83
• It shows that like_science_fiction and play_chess are correlated in the
group

CSCI322: Data Analysis


Correlation Analysis (Nominal
Data)

• To be sure that the Χ2 result gives a real statistically significant


difference, the P-value (significance level) should be looked up.
• The number of degrees of freedom associated with a chi-square test
is equal to the number of categories minus one.
• That is, df = (i-1)×(j−1).

CSCI322: Data Analysis


Correlation Analysis (Nominal
Data)

• Since our computed value (507.93) is above this (10.83), we can


reject the hypothesis that play_chess and like_science_fiction are
independent.
• Also, we conclude that the two attributes are (strongly) correlated for
the given group of people.

• Correlation does not imply causality


• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

CSCI322: Data Analysis


Correlation Analysis (Numeric
Data)

• Correlation coefficient (also called Pearson’s product moment


coefficient)

åi=1 (ai - A)(bi - B) å


n n
(ai bi ) - n AB
rA, B = = i =1
(n -n1)s As B (n -n1)s As B

• where n is the number of tuples, A and B are the respective means


of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

CSCI322: Data Analysis


Visually Evaluating Correlation

Scatter plots showing


the similarity from –1
to 1.

CSCI322: Data Analysis


Correlation (viewed as linear
relationship)

• Correlation measures the linear relationship between objects


• To compute correlation, we standardize data objects, A and B, and
then take their dot product

a'k = (ak - mean( A)) / std ( A)


b'k = (bk - mean( B)) / std ( B)
correlation( A, B) = A'•B'
CSCI322: Data Analysis
Covariance (Numeric Data)

• Covariance is similar to correlation

• Correlation coefficient:

• where n is the number of tuples, A and B are the respective mean


or expected values of A and B, σA and σB are the respective standard
deviation of A and B.

CSCI322: Data Analysis


Covariance (Numeric Data)

• Positive covariance: If CovA,B > 0, then A and B both tend to be


larger than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but
are not independent. Only under some additional assumptions
does a covariance of 0 imply independence

CSCI322: Data Analysis


Covariance (Numeric Data)

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.

CSCI322: Data Analysis


Data Fusion
• Within big data time, a massive amount of heterogeneous data are
generated, from different sources in different domains, as social media,
transportation, health care, and wireless communication networks.
• These data are varied in modality, representation, and distribution.
• Integrating diverse data with different modalities to develop a more usable
form of information is called multimodal data fusion.
• Multimodal data fusion approaches
Classified into four categories according to the level of fusion
Ø Raw data fusion (early fusion).
Ø Decision fusion (late fusion).
Ø Hybrid fusion.
Ø Features fusion.

CSCI322: Data Analysis


Raw data fusion
Data Fusion
• Is a straightforward approach; it is just a concatenation among diverse
data modalities to be the input into a machine learning algorithm.

Decision fusion
• Is an approach performs the fusion after taking a separate decision on
each modality.
• In this approach, using the same predictive model for all modalities or
different predictive models for each modality.
• Several decision fusion techniques may be used, such as voting
schemes, signal variance, averaging, and weighting based on channel
noise.

CSCI322: Data Analysis


Data Fusion
Hybrid data fusion

• Is a combination of early fusion and late fusion approaches.

• It performs fusion on the decision using one of the decision fusion techniques,
but the input of each predictive model is a concatenation of diverse
modalities, and the predictive models must be different.

Features fusion

• Is an approach exploits the benefits of linear and nonlinear correlations


among the feature modalities to generate a single unified representation of
those modalities.

CSCI322: Data Analysis

You might also like