CSCI322 - Lecture 2
CSCI322 - Lecture 2
CSCI322: Data
Analysis
Analysis
Lecture 1: Data Types,
Lecture 2: Data Preprocessing I
Collection, Sampling, and
Preprocessing
Dr. Noha Gamal, Dr. Mustafa
Elattar, Dr. Mohamed Nagy
Dr. Mustafa Elattar
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Fusion
• Data Transformation and Data Discretization
Jane
2 102 35 1.63 Oct-2023 0.25
Smith
Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson
Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White
Michael Oct-2023
5 105 40 $100,000 1.82 0
Brown
Goerge $6000,00
1 101 28 1.82 Oct-2023 x
Marco 0
3
CSCI322: Data Analysis
Un-Interpretable
Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson
UNTIMELY
Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White NOISY/WRONG
UN- Beliviable
Michael Oct-2023
5 105 40 $100,000 1.82 A
Brown
Discrepancy
between Goerge $6000,00 INCONSISTENT
1 101 28 1.82 Oct-2023 X
duplicate records Marco 0
INACCURATE
4
CSCI322: Data Analysis
Data Quality: Why Preprocess the
Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
How to enhance the data quality?, taking into consideration
that, we don’t have the luxury to drop each problematic
record !!
CSCI322: Data Analysis
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
7
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
8
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies Student Attenda
ID Name Age nce
• Example: In a students database, some entries in the 001 Alice 20 90%
"Attendance" column are missing. Cleaning involves 002 Bob 21
deciding whether to fill in those missing values or remove 003 Charlie 19 85%
the corresponding rows, ensuring accurate records of
student attendance.
• Data integration
Books_
• Integration of multiple databases, data cubes, or files Student Attenda Borrow
ID Name Age nce ed
• Example: Combining information from the student 001 Alice 20 90% 5
database with data from the library system to create a
002 Bob 21 3
unified view of student profiles, including academic
records and library borrowing history. 003 Charlie 19 85% 2
9
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data reduction Student Math English Physics Chemist
Example: Consider a dataset with scores in multiple subjects for ID Score Score Score ry Score
each student. Using dimensionality reduction techniques like PCA, 001 90 85 88 92
you can represent the essential information with fewer dimensions. 002 78 92 85 80
For instance, transforming scores in Math, English, Physics, and
003 85 88 90 78
Chemistry into a single composite score.
• Dimensionality reduction (Reducing the number of
dimensions (scores in different subjects) while preserving the Student ID Composite Score
most critical information using techniques like Principal 001 2.0
Component Analysis (PCA).) 002 -0.5
003 -1.5
• Numerosity reduction (Reducing detailed data, if not needed
in the analysis (individual student records) to summary
Department Enrollment
information, in this case, the total enrollment in different
Mathematics 150
departments.)
Biology 120
• Data compression (Reducing the size of the data History 100
representation, possibly by storing only essential information
(e.g., student ID, name, GPA) rather than the full dataset.)
12
CSCI322: Data Analysis
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian formula
or decision tree
Handling missing data is a critical aspect of data preprocessing to ensure accurate and
reliable analyses. Here are several strategies to deal with missing data:
1. Data Imputation: Mean/Median/Mode Imputation, Regression Imputation, the
attribute mean, the attribute mean for all samples belonging to the same class:
smarter!.
2. Deletion Strategies: Listwise Deletion/Dropping Rows, Column Deletion
3. Interpolation: Time Series Interpolation: For time-series data, missing values can
be estimated based on the values before and after the missing points.
4. Prediction Models: Machine Learning Models: Train predictive models to estimate
missing values based on the relationships observed in the rest of the data using
Bayesian formula or decision tree for example.
5. Multiple Imputation: Generate Multiple Imputations: Create multiple imputed
datasets and perform analyses on each. Combine results to account for uncertainty
introduced by imputation.
6. Using Domain Knowledge: Expert Input: Seek input from domain experts, Costly
and infeasible.
7. Handling for Categorical Data: Create a Separate Category: For categorical data,
create a separate category for missing values, replace using the most probable
value.
8. Utilizing Software Functions: Built-in Functions: Many programming languages
and statistical software have built-in functions for handling missing data.
16
CSCI322: Data Analysis
Age Gender Height
25 Male 172.72
21
CSCI322: Data Analysis
Data Integration Challenges
• Data integration:
• Combines data from multiple sources into a coherent storage
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales, e.g.,
metric vs. British units
23
CSCI322: Data Analysis
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
• The larger the Χ2 value, the more likely the attributes are related
• The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
• The Χ2 statistic tests the hypothesis that X and Y are independent,
that is, there is no correlation between them
• The test is based on a significance level, with degrees of freedom
• If the hypothesis can be rejected, then we say that X and Y are
statistically correlated
expected counts)
( 250 - 90 ) 2
(50 - 210) 2
( 200 - 360) 2
(1000 - 840) 2
c2 = + + + = 507.93
90 210 360 840
• The degrees of freedom are (2-1)(2-1) = 1
• The Χ2 value needed to reject the hypothesis at the 0.001 significance
level is 10.83
• It shows that like_science_fiction and play_chess are correlated in the
group
• Correlation coefficient:
• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Decision fusion
• Is an approach performs the fusion after taking a separate decision on
each modality.
• In this approach, using the same predictive model for all modalities or
different predictive models for each modality.
• Several decision fusion techniques may be used, such as voting
schemes, signal variance, averaging, and weighting based on channel
noise.
• It performs fusion on the decision using one of the decision fusion techniques,
but the input of each predictive model is a concatenation of diverse
modalities, and the predictive models must be different.
Features fusion