0% found this document useful (0 votes)

71 views38 pages

CSCI322 - Lecture 2

The document outlines the importance of data preprocessing in data analysis, detailing various tasks such as data cleaning, integration, reduction, and transformation. It emphasizes the need for data quality measures, including accuracy, completeness, consistency, and timeliness, to ensure reliable analysis. Additionally, it discusses methods for handling missing and noisy data, highlighting strategies like data imputation and the integration of multiple data sources.

Uploaded by

Noha Gamal Eldin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views38 pages

CSCI322 - Lecture 2

Uploaded by

Noha Gamal Eldin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

CSCI322: Data

CSCI322: Data
Analysis
Analysis
Lecture 1: Data Types,
Lecture 2: Data Preprocessing I
Collection, Sampling, and
Preprocessing
Dr. Noha Gamal, Dr. Mustafa
Elattar, Dr. Mohamed Nagy
Dr. Mustafa Elattar

CSCI322: Data Analysis

Data Preprocessing

• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Fusion
• Data Transformation and Data Discretization

CSCI322: Data Analysis

Record Employe Last
ID eID Name Age Salary Height updated SVC

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Jane
2 102 35 1.63 Oct-2023 0.25
Smith

Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson

Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White

Michael Oct-2023
5 105 40 $100,000 1.82 0
Brown

Goerge $6000,00
1 101 28 1.82 Oct-2023 x
Marco 0

3
CSCI322: Data Analysis
Un-Interpretable

Record Employe Last

ID eID Name Age Salary Height updated SVC

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Jane UNAVAILABLE 1.63

2 102 35 Oct-2023 0.25
Smith

Robert
3 103 45 $90,000 1.82 Jan-2019 15.68
Johnson
UNTIMELY
Emily Oct-2023
4 104 -10 $80,000 1.82 188.78
White NOISY/WRONG
UN- Beliviable

Michael Oct-2023
5 105 40 $100,000 1.82 A
Brown
Discrepancy
between Goerge $6000,00 INCONSISTENT
1 101 28 1.82 Oct-2023 X
duplicate records Marco 0
INACCURATE

4
CSCI322: Data Analysis
Data Quality: Why Preprocess the
Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
How to enhance the data quality?, taking into consideration
that, we don’t have the luxury to drop each problematic
record !!
CSCI322: Data Analysis
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

CSCI322: Data Analysis

Data Quality: What is Data Pre-
processing?
• Data preprocessing is essentially the process of cleaning,
integrating, transforming, and organizing raw data into a format
suitable for analysis.
• It acts as the gateway between the raw, often messy, data we collect and the
meaningful insights we seek to extract.

• Raw data is the unprocessed information collected directly from

observations, surveys, or experiments. It's the starting point of our
analysis, often messy and unstructured. Characteristics of raw data
include its originality, lack of organization, and potential presence of
errors or inconsistencies.

7
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

8
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies Student Attenda
ID Name Age nce
• Example: In a students database, some entries in the 001 Alice 20 90%
"Attendance" column are missing. Cleaning involves 002 Bob 21
deciding whether to fill in those missing values or remove 003 Charlie 19 85%
the corresponding rows, ensuring accurate records of
student attendance.
• Data integration
Books_
• Integration of multiple databases, data cubes, or files Student Attenda Borrow
ID Name Age nce ed
• Example: Combining information from the student 001 Alice 20 90% 5
database with data from the library system to create a
002 Bob 21 3
unified view of student profiles, including academic
records and library borrowing history. 003 Charlie 19 85% 2

9
CSCI322: Data Analysis
Major Tasks in Data Pre-processing
• Data reduction Student Math English Physics Chemist
Example: Consider a dataset with scores in multiple subjects for ID Score Score Score ry Score
each student. Using dimensionality reduction techniques like PCA, 001 90 85 88 92
you can represent the essential information with fewer dimensions. 002 78 92 85 80
For instance, transforming scores in Math, English, Physics, and
003 85 88 90 78
Chemistry into a single composite score.
• Dimensionality reduction (Reducing the number of
dimensions (scores in different subjects) while preserving the Student ID Composite Score
most critical information using techniques like Principal 001 2.0
Component Analysis (PCA).) 002 -0.5
003 -1.5
• Numerosity reduction (Reducing detailed data, if not needed
in the analysis (individual student records) to summary
Department Enrollment
information, in this case, the total enrollment in different
Mathematics 150
departments.)
Biology 120
• Data compression (Reducing the size of the data History 100
representation, possibly by storing only essential information
(e.g., student ID, name, GPA) rather than the full dataset.)

Student Addres Phone nationa Student

ID Name s number lity GPA ID Name GPA
001 Alice ********* ********* ********* 3.75
001 Alice 3.75
002 Bob ********* ********* *********
002 Bob
003 Charlie ********* ********* ********* 2.5
003 Charlie 2.5
10
CSCI322: Data Analysis
Engli
Major Tasks in Data Pre-processing
Stud Math sh
ent Scor Scor
• Data transformation and data discretization ID Age e e
Data transformation involves converting the original data into a different format, typically to 001 20 90 85
make it more suitable for analysis or to meet specific requirements, such as, creating new
variables, Normalization, or aggregating (average, or sum,..). Data discretization involves 002 21 92 78
converting continuous data (like ages) into discrete categories
Perf
003 19 85 88
• Normalization Engli orm
Stud Math Total Stud Math sh ance
• Stud
ent Scor Ques ent Scor Scor Leve
ent Normalized ID Age e e l
ID e tions ID Math Score
001 90 100 001 20 90 85 High
001 0.9
002 78 90 Medi
002 0.866 002 21 78 92
This transformation adds a categorical
um va
003 85 95 003 0.894 that simplifies the interpretation of
003 19Discretized
85Dataset:88 High
• Concept hierarchy generation performance
Perfor
Studen Age Math English mance
Course ID Course Name Category t ID Group Score Score Level

Computer 001 20-24 90 85 High

101 CS
skills 002 20-24 78 92
Mediu
m
Presentation 003 15-19 85 88 High11
102
CSCI322: Data Analysis GER
Skills
Outline
• Data Pre-processing: An Overview
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

12
CSCI322: Data Analysis
Data Cleaning

• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

CSCI322: Data Analysis

Incomplete (Missing) Data

• Data is not always available

• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred

CSCI322: Data Analysis

Incomplete (Missing) Data

• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the percentage of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian formula
or decision tree

CSCI322: Data Analysis

How to Handle Missing Data?
Methods other than Re-collecting the data

Handling missing data is a critical aspect of data preprocessing to ensure accurate and
reliable analyses. Here are several strategies to deal with missing data:
1. Data Imputation: Mean/Median/Mode Imputation, Regression Imputation, the
attribute mean, the attribute mean for all samples belonging to the same class:
smarter!.
2. Deletion Strategies: Listwise Deletion/Dropping Rows, Column Deletion
3. Interpolation: Time Series Interpolation: For time-series data, missing values can
be estimated based on the values before and after the missing points.
4. Prediction Models: Machine Learning Models: Train predictive models to estimate
missing values based on the relationships observed in the rest of the data using
Bayesian formula or decision tree for example.
5. Multiple Imputation: Generate Multiple Imputations: Create multiple imputed
datasets and perform analyses on each. Combine results to account for uncertainty
introduced by imputation.
6. Using Domain Knowledge: Expert Input: Seek input from domain experts, Costly
and infeasible.
7. Handling for Categorical Data: Create a Separate Category: For categorical data,
create a separate category for missing values, replace using the most probable
value.
8. Utilizing Software Functions: Built-in Functions: Many programming languages
and statistical software have built-in functions for handling missing data.
16
CSCI322: Data Analysis
Age Gender Height

25 Male 172.72

Noisy Data 32 Female

45 Male
28 Female
37 Male
157.48
180.34
165.1
175.26
22 Female 152.4
29 Male 182.88
40 Female 160.02
33 Male 177.8

• Noise: random error or variance in a measured variable 26 Female

38 Male
162.56
172.72
30 Female 154.94
42 Male 185.42

• Incorrect attribute values may be due to 27 Female

35 Male
167.64
170.18
24 Female 149.86
36 Male 177.8
• faulty data collection instruments 31 Female
44 Male
157.48
180.34
29 Female 165.1

• data entry problems

39 Male 175.26
23 Female 152.4
32 Male 182.88
41 Female 160.02

• data transmission problems 34 Male

25 Female
177.8
162.56
37 Male 172.72
28 Female 154.94

• technology limitation 43 Male

26 Female
185.42
167.64
35 Male 170.18
24 Female 149.86
• inconsistency in naming convention 36 Male
31 Female
177.8
157.48
45 Male 180.34

• Other data problems which require data cleaning

29 Female 165.1
38 Male 175.26
22 Female 152.4
30 Male 182.88

• duplicate records 41 Female

33 Male
160.02
177.8
27 Female 167.64
39 Male 175.26

• incomplete data 23 Female

32 Male
152.4
182.88
43 Female 160.02
34 Male 177.8
• inconsistent data 25 Female
37 Male
162.56
172.72
28 Female 154.94
CSCI322: Data Analysis 150 149.86
Noisy Data
• Binning
• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression
functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
(e.g., deal with possible outliers)
18

CSCI322: Data Analysis

smooth by bin
means for two
different features,
age and height
which helped
noticeably to
reduce the error
caused by the
incorrect or noisy
values
CSCI322: Data Analysis
Data Cleaning as a Process

• Data discrepancy detection violators (e.g., correlation and clustering

• Use metadata (e.g., domain, range, to find outliers)

dependency, distribution) • Data migration and integration

• Check field overloading • Data migration tools: allow

• Check uniqueness rule, consecutive transformations to be specified

rule and null rule • ETL

• Use commercial tools (Extraction/Transformation/Loading)

• Data scrubbing: use simple domain tools: allow users to specify

knowledge (e.g., postal code, spell-check) transformations through a graphical

to detect errors and make corrections user interface

• Integration of the two processes

• Data auditing: by analyzing data to
discover rules and relationship to detect • Iterative and interactive

CSCI322: Data Analysis

Outline
• Data Pre-processing: An Overview
• Data Quality
• Major Tasks in Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

21
CSCI322: Data Analysis
Data Integration Challenges

• Data integration:
• Combines data from multiple sources into a coherent storage
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales, e.g.,
metric vs. British units

CSCI322: Data Analysis

23
CSCI322: Data Analysis
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

CSCI322: Data Analysis

Correlation Analysis (Nominal
Data)
• Χ2 (chi-square) test

(Oij#Eij)! 𝑶𝒊. 𝑶.𝒋

𝜒! =∑ where 𝑬𝒊𝒋 =
Eij 𝑺

• The larger the Χ2 value, the more likely the attributes are related
• The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
• The Χ2 statistic tests the hypothesis that X and Y are independent,
that is, there is no correlation between them
• The test is based on a significance level, with degrees of freedom
• If the hypothesis can be rejected, then we say that X and Y are
statistically correlated

CSCI322: Data Analysis

Correlation Analysis (Nominal
Data)

• A contingency table records the frequencies of different attributes.

• Let Oij denote the joint event that attribute X takes on value Xi and attribute Y takes
on value Yj.
• Each and every possible Oij has its own cell (or slot) in the table.
• Also, we have Oi. = Ʃj Oij and O.j = Ʃi Oij .
• The total number of data tuples S = Ʃij Oij .

CSCI322: Data Analysis

Chi-Square Calculation: An
Example
Play chess Not play chess Sum (row)
• Χ2 calculation Like science fiction 250(90) 200(360) 450
(numbers in Not like science fiction 50(210) 1000(840) 1050
parenthesis are Sum(col.) 300 1200 1500

expected counts)
( 250 - 90 ) 2
(50 - 210) 2
( 200 - 360) 2
(1000 - 840) 2
c2 = + + + = 507.93
90 210 360 840
• The degrees of freedom are (2-1)(2-1) = 1
• The Χ2 value needed to reject the hypothesis at the 0.001 significance
level is 10.83
• It shows that like_science_fiction and play_chess are correlated in the
group

CSCI322: Data Analysis

Correlation Analysis (Nominal
Data)

• To be sure that the Χ2 result gives a real statistically significant

difference, the P-value (significance level) should be looked up.
• The number of degrees of freedom associated with a chi-square test
is equal to the number of categories minus one.
• That is, df = (i-1)×(j−1).

CSCI322: Data Analysis

Correlation Analysis (Nominal
Data)

• Since our computed value (507.93) is above this (10.83), we can

reject the hypothesis that play_chess and like_science_fiction are
independent.
• Also, we conclude that the two attributes are (strongly) correlated for
the given group of people.

• Correlation does not imply causality

• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

CSCI322: Data Analysis

Correlation Analysis (Numeric
Data)

• Correlation coefficient (also called Pearson’s product moment

coefficient)

åi=1 (ai - A)(bi - B) å

n n
(ai bi ) - n AB
rA, B = = i =1
(n -n1)s As B (n -n1)s As B

• where n is the number of tuples, A and B are the respective means

of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

CSCI322: Data Analysis

Visually Evaluating Correlation

Scatter plots showing

the similarity from –1
to 1.

CSCI322: Data Analysis

Correlation (viewed as linear
relationship)

• Correlation measures the linear relationship between objects

• To compute correlation, we standardize data objects, A and B, and
then take their dot product

a'k = (ak - mean( A)) / std ( A)

b'k = (bk - mean( B)) / std ( B)
correlation( A, B) = A'•B'
CSCI322: Data Analysis
Covariance (Numeric Data)

• Covariance is similar to correlation

• Correlation coefficient:

• where n is the number of tuples, A and B are the respective mean

or expected values of A and B, σA and σB are the respective standard
deviation of A and B.

CSCI322: Data Analysis

Covariance (Numeric Data)

• Positive covariance: If CovA,B > 0, then A and B both tend to be

larger than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but
are not independent. Only under some additional assumptions
does a covariance of 0 imply independence

CSCI322: Data Analysis

Covariance (Numeric Data)

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.

CSCI322: Data Analysis

Data Fusion
• Within big data time, a massive amount of heterogeneous data are
generated, from different sources in different domains, as social media,
transportation, health care, and wireless communication networks.
• These data are varied in modality, representation, and distribution.
• Integrating diverse data with different modalities to develop a more usable
form of information is called multimodal data fusion.
• Multimodal data fusion approaches
Classified into four categories according to the level of fusion
Ø Raw data fusion (early fusion).
Ø Decision fusion (late fusion).
Ø Hybrid fusion.
Ø Features fusion.

CSCI322: Data Analysis

Raw data fusion
Data Fusion
• Is a straightforward approach; it is just a concatenation among diverse
data modalities to be the input into a machine learning algorithm.

Decision fusion
• Is an approach performs the fusion after taking a separate decision on
each modality.
• In this approach, using the same predictive model for all modalities or
different predictive models for each modality.
• Several decision fusion techniques may be used, such as voting
schemes, signal variance, averaging, and weighting based on channel
noise.

CSCI322: Data Analysis

Data Fusion
Hybrid data fusion

• Is a combination of early fusion and late fusion approaches.

• It performs fusion on the decision using one of the decision fusion techniques,
but the input of each predictive model is a concatenation of diverse
modalities, and the predictive models must be different.

Features fusion

• Is an approach exploits the benefits of linear and nonlinear correlations

among the feature modalities to generate a single unified representation of
those modalities.

CSCI322: Data Analysis

CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
3 Processing
No ratings yet
3 Processing
79 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
60 pages
Class-Data Preprocessing-II
No ratings yet
Class-Data Preprocessing-II
57 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Lec 3
No ratings yet
Lec 3
31 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Unit 1
No ratings yet
Unit 1
44 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
2nd Cleaning
No ratings yet
2nd Cleaning
46 pages
DM Merged
No ratings yet
DM Merged
169 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Week2 2
No ratings yet
Week2 2
25 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Data Preprocessing for Regression Analysis
No ratings yet
Data Preprocessing for Regression Analysis
56 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
ML 4
No ratings yet
ML 4
17 pages
Session 4
No ratings yet
Session 4
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
11 pages
Unit 3
No ratings yet
Unit 3
41 pages
Unit - II
No ratings yet
Unit - II
56 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
65 pages
Unit 3
No ratings yet
Unit 3
164 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
CSCI205 - Introduction to Computer System - Lecture 01
No ratings yet
CSCI205 - Introduction to Computer System - Lecture 01
28 pages
Talk2Care: An LLM-based Voice Assistant For Communication Between Healthcare Providers and Older Adults
No ratings yet
Talk2Care: An LLM-based Voice Assistant For Communication Between Healthcare Providers and Older Adults
35 pages
Applied Cryptography and Network Security: Christina Pöpper Lejla Batina
No ratings yet
Applied Cryptography and Network Security: Christina Pöpper Lejla Batina
476 pages
Introduction To Side-Channel Attacks
100% (1)
Introduction To Side-Channel Attacks
16 pages
Best Practices For Team-Based Development
No ratings yet
Best Practices For Team-Based Development
4 pages
AI Lecture Slides WK03-1
No ratings yet
AI Lecture Slides WK03-1
35 pages
PowerBIEmbeddedConfiguration PDF
No ratings yet
PowerBIEmbeddedConfiguration PDF
98 pages
Build and Run Your First Docker Windows Server Container - Docker Blog
No ratings yet
Build and Run Your First Docker Windows Server Container - Docker Blog
22 pages
IDS Reference Architecture Model 3.0 2019
No ratings yet
IDS Reference Architecture Model 3.0 2019
118 pages
3 Hours / 100 Marks: Seat No
No ratings yet
3 Hours / 100 Marks: Seat No
5 pages
Convenciones sql-1
No ratings yet
Convenciones sql-1
36 pages
Mohamed Abuthahir: QA Automation Expert
No ratings yet
Mohamed Abuthahir: QA Automation Expert
1 page
Data Structures Algorithms
No ratings yet
Data Structures Algorithms
27 pages
Essentials of Pattern Recognition An Accessible Approach 1st Edition Jianxin Wu PDF Download
No ratings yet
Essentials of Pattern Recognition An Accessible Approach 1st Edition Jianxin Wu PDF Download
134 pages
Certificate in Computing Exam 2008
No ratings yet
Certificate in Computing Exam 2008
20 pages
Basic Mathematics DPP
No ratings yet
Basic Mathematics DPP
24 pages
Determinants in Arihant Maths Book
No ratings yet
Determinants in Arihant Maths Book
136 pages
MCA Problem Solving & Programming Exam 2018
No ratings yet
MCA Problem Solving & Programming Exam 2018
25 pages
Class 5 Computer Term 1
No ratings yet
Class 5 Computer Term 1
2 pages
IMC Number Theory Problems and Solutions
No ratings yet
IMC Number Theory Problems and Solutions
3 pages
May 2024 - Paper 2
No ratings yet
May 2024 - Paper 2
20 pages
07 Network Setup B
No ratings yet
07 Network Setup B
37 pages
Bit Bank Mid2
No ratings yet
Bit Bank Mid2
3 pages
DSA456 Midterm Radu PDF
No ratings yet
DSA456 Midterm Radu PDF
10 pages
Gerard Gael Graffino Salmon Clarissa Ortega Victor Abdiel Oscar Becerril
No ratings yet
Gerard Gael Graffino Salmon Clarissa Ortega Victor Abdiel Oscar Becerril
37 pages
Industrial Training Report
No ratings yet
Industrial Training Report
12 pages
Medical Device Software Guidance
No ratings yet
Medical Device Software Guidance
22 pages
MMM-R FCT Man 0722 en-US
No ratings yet
MMM-R FCT Man 0722 en-US
206 pages
Computer Generations and Types Explained
No ratings yet
Computer Generations and Types Explained
11 pages
Jumper Settings 2003 Mode
No ratings yet
Jumper Settings 2003 Mode
9 pages
The Encoding For Thai: Werner Lemberg 2005/07/04
No ratings yet
The Encoding For Thai: Werner Lemberg 2005/07/04
34 pages
User Interface
No ratings yet
User Interface
56 pages
BCUMDKVHGD
No ratings yet
BCUMDKVHGD
1 page
Hotel Marketing Plan by Slidesgo
No ratings yet
Hotel Marketing Plan by Slidesgo
55 pages

CSCI322 - Lecture 2

Uploaded by

CSCI322 - Lecture 2

Uploaded by

CSCI322: Data

CSCI322: Data Analysis

CSCI322: Data Analysis

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Record Employe Last

1 101 John Doe 28 $60,000 1.75 Oct-2023 100.253

Jane UNAVAILABLE 1.63

CSCI322: Data Analysis

• Raw data is the unprocessed information collected directly from

Student Addres Phone nationa Student

Computer 001 20-24 90 85 High

CSCI322: Data Analysis

• Data is not always available

CSCI322: Data Analysis

CSCI322: Data Analysis

Noisy Data 32 Female

• Noise: random error or variance in a measured variable 26 Female

• Incorrect attribute values may be due to 27 Female

• data entry problems

• data transmission problems 34 Male

• technology limitation 43 Male

• Other data problems which require data cleaning

• duplicate records 41 Female

• incomplete data 23 Female

CSCI322: Data Analysis

• Data discrepancy detection violators (e.g., correlation and clustering

dependency, distribution) • Data migration and integration

• Check field overloading • Data migration tools: allow

• Check uniqueness rule, consecutive transformations to be specified

rule and null rule • ETL

• Data scrubbing: use simple domain tools: allow users to specify

knowledge (e.g., postal code, spell-check) transformations through a graphical

to detect errors and make corrections user interface

• Integration of the two processes

CSCI322: Data Analysis

CSCI322: Data Analysis

CSCI322: Data Analysis

(Oij#Eij)! 𝑶𝒊. 𝑶.𝒋

CSCI322: Data Analysis

• A contingency table records the frequencies of different attributes.

CSCI322: Data Analysis

CSCI322: Data Analysis

• To be sure that the Χ2 result gives a real statistically significant

CSCI322: Data Analysis

• Since our computed value (507.93) is above this (10.83), we can

• Correlation does not imply causality

CSCI322: Data Analysis

• Correlation coefficient (also called Pearson’s product moment

åi=1 (ai - A)(bi - B) å

• where n is the number of tuples, A and B are the respective means

CSCI322: Data Analysis

Scatter plots showing

CSCI322: Data Analysis

• Correlation measures the linear relationship between objects

a'k = (ak - mean( A)) / std ( A)

• Covariance is similar to correlation

• where n is the number of tuples, A and B are the respective mean

CSCI322: Data Analysis

• Positive covariance: If CovA,B > 0, then A and B both tend to be

CSCI322: Data Analysis

• It can be simplified in computation as

CSCI322: Data Analysis

CSCI322: Data Analysis

CSCI322: Data Analysis

• Is a combination of early fusion and late fusion approaches.

• Is an approach exploits the benefits of linear and nonlinear correlations

CSCI322: Data Analysis

You might also like