0% found this document useful (0 votes)

2 views21 pages

DM Data Preprocessing

Data preprocessing is crucial in data mining and machine learning to enhance data quality by addressing issues such as incompleteness, noise, and inconsistency. Key tasks include data cleaning, integration, reduction, and transformation, each aimed at ensuring accurate and meaningful analysis. Proper preprocessing leads to better decision-making and insights by providing a clean, structured dataset for further analysis.

Uploaded by

Aarya Gharmalkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views21 pages

DM Data Preprocessing

Uploaded by

Aarya Gharmalkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Quality and Data Preprocessing –

Data preprocessing is one of the most important steps in data mining and machine learning.
Raw data obtained from real-world sources is often incomplete, noisy, inconsistent,
outdated, or difficult to interpret. If we use such poor-quality data directly, the mining results
become incorrect or misleading. Hence preprocessing improves data quality and prepares data
for further analysis.

1. Data Quality: Why Preprocess the Data?

To perform meaningful analysis, we must ensure that the data satisfies several data quality
dimensions. Poor-quality data leads to wrong decisions.

Measures of Data Quality (A Multidimensional View)

1. Accuracy

● Data should be correct, valid, and free from errors.

● Example: Salary = –10 is inaccurate.

2. Completeness

● All required attributes must be recorded.

● Missing values, unrecorded fields reduce completeness.

3. Consistency

● Values should not conflict across the database.

● Example: Age = 42 but Birthdate = 2010 is inconsistent.

4. Timeliness
● Data should be updated at the right time.

● Outdated data reduces reliability.

5. Believability

● Trustworthiness of data sources.

● Data should not contain fake or intentionally incorrect values.

6. Interpretability

● Data should be understandable and properly documented (metadata, units, codes)

2. Major Tasks in Data Preprocessing

Data preprocessing includes several essential steps that convert raw data into a clean,
integrated, and well-structured form.

A. Data Cleaning
This step handles missing data, noisy data, and inconsistent data. It ensures correctness
and completeness.
Main tasks include:

● Filling missing values

● Smoothing noisy data

● Identifying or removing outliers

● Resolving inconsistencies

B. Data Integration
Combining data from multiple sources such as relational databases, files, spreadsheets, or data
cubes. Integration helps remove redundancy and ensures a unified view of the data.

C. Data Reduction
It reduces the volume of data while preserving its analytical integrity.
Includes:

● Dimensionality reduction (feature selection and extraction)

● Numerosity reduction (histograms, sampling, clustering)

● Data compression (lossy/lossless)

D. Data Transformation and Discretization

Converting data into appropriate formats for mining.
Includes:

● Normalization

● Attribute construction

● Aggregation

● Discretization and concept hierarchy generation

3. Data Cleaning (Detailed)

Data cleaning is necessary because real-world data is rarely perfect. Problems arise due to
faulty instruments, typing mistakes, transmission errors, misunderstanding by users, or incorrect
data collection techniques.

1. Incomplete (Missing) Data

Reasons for Missing Data

● Equipment malfunction

● Deleted because of inconsistencies

● Certain values not considered important during data entry

● User forgot to record details

● Historical data not updated or maintained

Handling Missing Data

1. Ignore the tuple

Used when the class label is missing (in classification), but ineffective when many
values are missing.

2. Fill manually

Accurate but time-consuming and impractical for large datasets.

3. Fill with a global constant

Example: Replace with “unknown” — simple but may create a new class.

4. Fill with attribute mean

Suitable for numerical attributes but may reduce data variability.

5. Class-wise mean

Better than global mean because it considers class structure.

6. Most probable value

Using statistical or AI techniques such as Bayesian methods, regression, or decision
trees.

2. Noisy Data
Noise refers to random errors or variations in data values.
Causes include faulty sensors, typing errors, transmission issues, and format inconsistencies.

Handling Noisy Data

1. Binning

○ Sort data

○ Divide into equal-frequency bins

○ Smooth by mean, median, or bin boundaries

2. Regression

○ Fit a regression function to data (linear or polynomial)

○ Smooth values based on predicted trends

3. Clustering

○ Outliers form small clusters

○ These can be removed or treated separately

4. Human–computer inspection

○ Suspicious or extreme values manually checked and corrected

3. Inconsistent Data
Occurs when:

● Naming conventions differ

● Code systems change over time

● Different versions of records exist

● Duplicate records have conflicting values

Data Cleaning as a Process

● Data discrepancy detection: Using metadata such as domain, ranges, rules

● Field checks: Uniqueness rule, null rule, format rule

● Data scrubbing: Spell-check, postal code validation

● Data auditing: Discover hidden rules using correlation or clustering

● Data migration using ETL tools: For extraction, transformation, loading

● Interactive cleaning tools: Like Potter’s Wheel for iterative correction

4. Data Integration
Data integration is a crucial step in data preprocessing that involves combining data from
multiple sources—such as databases, data warehouses, flat files, sensors, and web logs—into
a single coherent, unified dataset.
Since modern organizations store data across different platforms and formats, integration
ensures that all information is brought together correctly so meaningful analysis and data mining
can be performed.

The goal of data integration is to provide a consistent, accurate, non-redundant view of data
across the entire organization.

However, integration is challenging because data from different sources may have differences in
naming, formats, structures, measurement units, and levels of detail.

Main Issues in Data Integration

1. Schema Integration
Schema integration deals with merging the schemas (structures, tables, fields) of different
databases.

Problems

● Same attribute may have different names in different sources.

Example:

○ A.cust-id

○ B.customer-number

○ C.cid

● Attributes may have different data types (e.g., integer vs string).

● Same concept may be represented differently (e.g., gender: M/F vs 0/1).

Goal

To create a unified schema without ambiguity or naming conflicts.

Techniques used include:

● Schema matching

● Mapping rules

● Standard naming conventions

2. Entity Identification Problem

Entity identification refers to determining whether two records from different sources represent
the same real-world entity.

Examples

● “Bill Clinton”, “William Clinton”, “W. J. Clinton”

● Customer ID in one file vs Email ID in another

● Phone number formats: (555)123-4567 vs 5551234567

Causes

● Different naming conventions

● Abbreviations

● Typographical differences

● Missing unique identifiers

Techniques used:
● Matching algorithms

● String similarity measures (Levenshtein distance)

● Domain knowledge

● Standardization rules

3. Detecting and Resolving Data Value Conflicts

Data value conflicts occur when the same attribute has different values in different systems.

Types of Conflicts

a) Different Measurement Units

● Weight stored as kg in one system, lbs in another

● Distance recorded as miles in one source, kilometers in another

b) Different Scales or Ranges

● Temperature in Celsius vs Fahrenheit

● Currency stored in USD vs INR

c) Different Coding Schemes

● Male/Female vs M/F vs 1/0

● Department codes: HR vs Human Resources

d) Different Levels of Granularity

● Monthly sales vs yearly sales

● City vs region level information

Resolving these conflicts requires:

● Standardization

● Unit conversion

● Common coding systems

● Data transformation rules

4. Handling Redundancy
When integrating multiple databases, redundancy is very common.

Types of Redundancy

a) Duplicate Attributes

Same information stored under different names in multiple tables.

Example:

● cust_id

● customer_id

● id

These need to be merged into one unique attribute.

b) Derivable Attributes

One attribute is computable from another.

Example:

● Annual revenue can be derived from monthly revenue

● Age can be calculated from birthdate

If such attributes are retained without checking, they cause:

● Inconsistency

● Extra storage

● Slower processing

Thus, redundancy must be detected and removed.

Tools for Redundancy Detection

Data integration uses statistical tools like:

1. Correlation Analysis

Checks if two attributes are strongly related.

High correlation often indicates redundancy.

2. Covariance Analysis

Measures how two attributes vary together.

A high covariance may suggest duplication or dependency.

3. Metadata Analysis

Uses data descriptions, units, types, and constraints to detect redundant or overlapping data.

4. Domain Knowledge

Human experts help understand which attributes are essential and which can be removed.

Importance of Proper Data Integration

● Avoids duplicates and inconsistencies

● Ensures reliable and trusted data

● Reduces storage and speeds up mining

● Produces meaningful and holistic insights

● Essential for building integrated data warehouses and OLAP systems

5. Data Reduction
Data reduction is an important step in data preprocessing that aims to obtain a smaller, more
compact representation of the dataset without losing significant information.
In real-world applications, data warehouses often contain gigabytes or terabytes of data.
Running complex data mining algorithms on such huge datasets becomes slow, costly, and
inefficient.
Data reduction techniques make analysis practical and faster by reducing volume but
preserving meaning.

Why Data Reduction?

Data reduction is necessary because:

1. Faster Data Mining

Algorithms run significantly faster on reduced datasets.

2. Less Storage

Compact representations save disk space and memory.

3. Improved Algorithm Performance

Many mining methods (clustering, classification, neural networks) work better when the dataset
is smaller and more manageable.

4. Easier visualization

High-dimensional data is difficult to visualize; reduction provides clearer insights.

A. Dimensionality Reduction
Dimensionality reduction deals with reducing the number of attributes or features.
In many machine learning tasks, datasets have dozens or hundreds of features, many of which
may be irrelevant, redundant, or noisy.

Problems Caused by Too Many Features

● Difficult to visualize or interpret

● Computation becomes expensive

● Leads to “curse of dimensionality”

● Many features may be correlated or useless

Dimensionality reduction techniques help simplify the model.

1. Feature Selection
Feature selection chooses a subset of the original features that are most relevant.
Instead of transforming features, it eliminates unnecessary ones.

Methods of Feature Selection

a) Filter Methods

● Use statistical tests such as correlation, chi-square, mutual information.

● Independent of learning algorithms.

b) Wrapper Methods

● Evaluate subsets by running a model (e.g., accuracy-based selection).

● More accurate but computationally expensive.

c) Embedded Methods

● Feature selection happens inside the model.

● Example: Decision trees, LASSO regression.

Feature selection reduces dimensionality without transforming the data, making it easier to
understand.

2. Feature Extraction
Unlike feature selection, feature extraction transforms the original attributes into a new, smaller
set of features.

Popular techniques:

a) PCA (Principal Component Analysis)

● Creates new orthogonal components.

● Captures maximum variance using fewer dimensions.

● Very effective when many features are correlated.

Feature extraction changes the data representation but preserves important patterns.

Advantages of Dimensionality Reduction

1. Requires less storage

2. Faster computation and model training

3. Helps remove redundant and irrelevant features

4. Reduces noise and overfitting

5. Improves visualization

Disadvantages
1. Possible loss of information

2. PCA assumes linear relationships among variables

3. Some techniques are not easily interpretable

4. Difficult to choose the correct number of components

B. Numerosity Reduction
Numerosity reduction reduces data volume by using fewer representations of the original data.
Instead of keeping all data, we store a compressed model of the data.

1. Parametric Methods
These methods assume that data follows a particular model and store only model parameters,
not the actual data.

Examples:

● Regression Models
Fit data using linear or nonlinear regression; only coefficients are stored.

● Log-Linear Models
Useful for multidimensional data; store parameters instead of full data cubes.

Advantages:

● Highly compressed

● Easy to store and compute

Disadvantages:
● Accuracy depends on model fit

2. Non-Parametric Methods
Do not assume any predefined model.

Examples:

● Histograms
Group data into bins and store only bin boundaries.

● Clustering
Represent data by cluster centers; outliers removed or grouped separately.

● Sampling
Select a representative sample instead of the full dataset.

Advantages:

● Flexible and simple

● Effective for large datasets

Disadvantages:

● May miss rare but important patterns

C. Data Cube Aggregation

Data cube aggregation reduces data by summarizing it at higher granularity.
For example:

● Daily sales → weekly or monthly sales

● City-level data → region-level data

This reduces the number of rows while retaining overall trends.

Advantages:

● Faster OLAP queries

● Useful for dashboards and business analytics

D. Data Compression
Data compression reduces storage requirements with or without losing information.

1. Lossless Compression

● No information is lost

● Original data can be perfectly reconstructed

Examples: Huffman coding, run-length encoding (RLE)

2. Lossy Compression

● Some information is lost

● Acceptable when exact accuracy is not needed

Examples: JPEG, audio compression, dimensionality reduction techniques

Lossy methods provide very high compression but may slightly reduce accuracy.

6. Data Transformation
Data transformation is a key step in data preprocessing. It involves converting data into a
suitable format or structure that makes mining, analysis, and modeling more effective. Since raw
data may exist in different scales, formats, or levels of detail, transformation helps standardize
and prepare data for algorithms such as classification, clustering, or association rule mining.

The main goal of data transformation is to improve the quality of data representation,
enhance accuracy, and reduce complexity without altering the essential meaning.
What is Data Transformation?
Data transformation is a process that maps the entire set of attribute values to new values
through mathematical, logical, or statistical operations.
Each original value can be linked back to its transformed value.

Transformation improves:

● Data quality

● Uniformity

● Scalability

● Interpretability

● Mining efficiency

Major Methods of Data Transformation

1. Smoothing
Smoothing techniques are used to remove noise (random errors or fluctuations) and make data
more consistent.

Methods of Smoothing:

● Binning:
Sort data and divide into bins; smooth using mean, median, or boundaries.

● Regression smoothing:
Fit the data into a regression function (linear or nonlinear).

● Moving averages:
Use average of recent values to reduce fluctuations.
Smoothing is commonly used in time-series data, sensor data, and noisy attributes.

2. Attribute/Feature Construction
This involves creating new attributes from existing ones to make patterns easier to discover.
Constructed features often capture important relationships or domain knowledge better.

Examples:

● From birthdate, compute age.

● Combine height and weight to create BMI.

● From transaction timestamps, extract day of week or hour.

Benefits:

● Improves model accuracy

● Makes hidden patterns visible

● Simplifies complex relationships

This is also known as feature generation.

3. Aggregation
Aggregation summarizes or combines two or more attributes or tuples to create new,
higher-level information.

Examples:

● Daily sales → weekly → monthly sales

● Summarizing customer transactions into total monthly spending

● Using count, sum, average on large datasets

Aggregation reduces data size (helpful for data reduction) and reveals higher-level trends used
in OLAP and data cube construction.

4. Normalization
Normalization is the process of scaling numerical data into a small, standard range, commonly
useful for algorithms that are distance-based (e.g., KNN, K-means) or gradient-based.

Types of Normalization:

(a) Min–Max Normalization

Maps values to a range [0,1] or any fixed range.

v′=v−minmax−minv' = \frac{v - min}{max - min}v′=max−minv−min

(b) Z-Score Normalization (Standardization)

Maps data to mean 0 and standard deviation 1.

v′=v−μσv' = \frac{v - \mu}{\sigma}v′=σv−μ

(c) Decimal Scaling Normalization

Moves decimal point based on maximum absolute value.

v′=v10jv' = \frac{v}{10^j}v′=10jv

Normalization improves algorithm performance by reducing the impact of varying scales.

5. Discretization
Discretization converts continuous numeric attributes into categorical or interval-based values
(bins).
It reduces the number of attribute values and simplifies analysis.

Examples:

● Age → {young, adult, senior}

● Income → {low, medium, high}

Methods:

● Binning

● Entropy-based splitting

● Equal-width or equal-frequency partitioning

Discretization is also the first step toward concept hierarchy generation.

6. Concept Hierarchy Generation

Concept hierarchies allow data to be viewed at different levels of abstraction.
It generalizes low-level (detailed) data into high-level (summarized) concepts.

Examples:

Location Hierarchy:

City → State → Country → Region

Product Hierarchy:

Item → Category → Department → All Products

Time Hierarchy:

Day → Month → Quarter → Year

Concept hierarchies enrich OLAP operations such as roll-up and drill-down.

Why Data Transformation Is Important

● Helps machine learning algorithms understand data better
● Reduces complexity and dimensionality

● Removes noise and makes patterns clearer

● Ensures uniform scaling of attributes

● Enables better performance for distance-based models

● Provides meaningful summaries for business decisions

Without transformation, many algorithms would perform poorly or produce misleading results.

Module 2 - AJ
No ratings yet
Module 2 - AJ
76 pages
Solid Waste Mod 1,2,3
No ratings yet
Solid Waste Mod 1,2,3
38 pages
Solid Waste Mod 1,2,3
No ratings yet
Solid Waste Mod 1,2,3
14 pages
MODULE 2 - Cryptography - Key Management, Distribution and User Authentication-1
No ratings yet
MODULE 2 - Cryptography - Key Management, Distribution and User Authentication-1
83 pages
FM Notes
No ratings yet
FM Notes
21 pages
College Management System Project Report
No ratings yet
College Management System Project Report
7 pages
Cloud Computing Basics and Benefits
No ratings yet
Cloud Computing Basics and Benefits
16 pages
2d Soccer Game Report
No ratings yet
2d Soccer Game Report
26 pages
Batch-to-Batch Transfer Procedure Guide
No ratings yet
Batch-to-Batch Transfer Procedure Guide
6 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Assignment 1: Q.1 Define Cloud Service Model ?
No ratings yet
Assignment 1: Q.1 Define Cloud Service Model ?
6 pages
Architecture - Complete Cyclos Documentation Wiki
No ratings yet
Architecture - Complete Cyclos Documentation Wiki
7 pages
Code List Mapping
No ratings yet
Code List Mapping
9 pages
REDCap Training 101 2022 - 1 - 24
No ratings yet
REDCap Training 101 2022 - 1 - 24
68 pages
WinRunner Tool: Automation Testing Overview
No ratings yet
WinRunner Tool: Automation Testing Overview
5 pages
Mid Semester Question Bank for SE 3161605
No ratings yet
Mid Semester Question Bank for SE 3161605
2 pages
COMSATS University Islamabad Lahore Campus: Sessional II - Examination
No ratings yet
COMSATS University Islamabad Lahore Campus: Sessional II - Examination
6 pages
Enterprise Network Complexity Metrics
No ratings yet
Enterprise Network Complexity Metrics
4 pages
C++ Implementation of Doubly Linked List
No ratings yet
C++ Implementation of Doubly Linked List
3 pages
Devsecops Security Checklist
100% (1)
Devsecops Security Checklist
18 pages
CIS Controls Commonly Exploited Protocols WMI v21 12 White Paper
No ratings yet
CIS Controls Commonly Exploited Protocols WMI v21 12 White Paper
42 pages
Manage Users and Groups
No ratings yet
Manage Users and Groups
1 page
Unit 3 Short Answer Type Ques-Ans
No ratings yet
Unit 3 Short Answer Type Ques-Ans
4 pages
2017-Ad-Martin Bach-Dank Swingbench Die Oracle Datenbank Besser Verstehen-Praesentation
No ratings yet
2017-Ad-Martin Bach-Dank Swingbench Die Oracle Datenbank Besser Verstehen-Praesentation
65 pages
MySQL Troubleshooting
No ratings yet
MySQL Troubleshooting
264 pages
Ccnasv1.1 Chp04 Lab-A Cbac-Zbf Instructor
No ratings yet
Ccnasv1.1 Chp04 Lab-A Cbac-Zbf Instructor
43 pages
DHIS2 Event Fundamentals: Course Overview Part 2 of 2
No ratings yet
DHIS2 Event Fundamentals: Course Overview Part 2 of 2
13 pages
Data Warehousing Exam Guide
No ratings yet
Data Warehousing Exam Guide
10 pages
Log
No ratings yet
Log
25 pages
Best Practices in Gathering Requirements For Sharepoint Projects
No ratings yet
Best Practices in Gathering Requirements For Sharepoint Projects
34 pages
BPMN Exercises. WagnerDevete
No ratings yet
BPMN Exercises. WagnerDevete
11 pages
NASA Business Systems Enhancement Roadmap
No ratings yet
NASA Business Systems Enhancement Roadmap
21 pages
Database Management System Overview
No ratings yet
Database Management System Overview
30 pages
AZ-900 Cheatsheet
100% (13)
AZ-900 Cheatsheet
22 pages
Project
No ratings yet
Project
2 pages

DM Data Preprocessing

Uploaded by

DM Data Preprocessing

Uploaded by

Data Quality and Data Preprocessing –

1. Data Quality: Why Preprocess the Data?

Measures of Data Quality (A Multidimensional View)

●​ Data should be correct, valid, and free from errors.​

●​ Example: Salary = –10 is inaccurate.​

●​ All required attributes must be recorded.​

●​ Missing values, unrecorded fields reduce completeness.​

●​ Values should not conflict across the database.​

●​ Example: Age = 42 but Birthdate = 2010 is inconsistent.​

●​ Outdated data reduces reliability.​

●​ Trustworthiness of data sources.​

●​ Data should not contain fake or intentionally incorrect values.​

●​ Data should be understandable and properly documented (metadata, units, codes)

2. Major Tasks in Data Preprocessing

●​ Filling missing values​

●​ Smoothing noisy data​

●​ Identifying or removing outliers​

●​ Dimensionality reduction (feature selection and extraction)​

●​ Numerosity reduction (histograms, sampling, clustering)​

●​ Data compression (lossy/lossless)​

D. Data Transformation and Discretization

●​ Discretization and concept hierarchy generation

3. Data Cleaning (Detailed)

1. Incomplete (Missing) Data

●​ Deleted because of inconsistencies​

●​ Certain values not considered important during data entry​

●​ Historical data not updated or maintained​

Handling Missing Data

1.​ Ignore the tuple​

2.​ Fill manually​

3.​ Fill with a global constant​

4.​ Fill with attribute mean​

5.​ Class-wise mean​

6.​ Most probable value​

Handling Noisy Data

○​ Divide into equal-frequency bins​

○​ Fit a regression function to data (linear or polynomial)​

○​ Smooth values based on predicted trends​

○​ Outliers form small clusters​

○​ These can be removed or treated separately​

4.​ Human–computer inspection​

○​ Suspicious or extreme values manually checked and corrected​

●​ Naming conventions differ​

●​ Code systems change over time​

●​ Different versions of records exist​

●​ Duplicate records have conflicting values​

Data Cleaning as a Process

●​ Data discrepancy detection: Using metadata such as domain, ranges, rules​

●​ Field checks: Uniqueness rule, null rule, format rule​

●​ Data scrubbing: Spell-check, postal code validation​

●​ Data auditing: Discover hidden rules using correlation or clustering​

●​ Interactive cleaning tools: Like Potter’s Wheel for iterative correction​

Main Issues in Data Integration

●​ Same attribute may have different names in different sources.​

●​ Attributes may have different data types (e.g., integer vs string).​

To create a unified schema without ambiguity or naming conflicts.​

●​ Standard naming conventions​

2. Entity Identification Problem

●​ “Bill Clinton”, “William Clinton”, “W. J. Clinton”​

●​ Customer ID in one file vs Email ID in another​

●​ Phone number formats: (555)123-4567 vs 5551234567​

●​ Different naming conventions​

●​ Missing unique identifiers​

●​ String similarity measures (Levenshtein distance)​

3. Detecting and Resolving Data Value Conflicts

a) Different Measurement Units

●​ Weight stored as kg in one system, lbs in another​

●​ Distance recorded as miles in one source, kilometers in another​

b) Different Scales or Ranges

●​ Temperature in Celsius vs Fahrenheit​

●​ Currency stored in USD vs INR​

c) Different Coding Schemes

●​ Male/Female vs M/F vs 1/0​

●​ Department codes: HR vs Human Resources​

d) Different Levels of Granularity

●​ Monthly sales vs yearly sales​

● Data should be correct, valid, and free from errors.

● Example: Salary = –10 is inaccurate.

● All required attributes must be recorded.

● Missing values, unrecorded fields reduce completeness.

● Values should not conflict across the database.

● Example: Age = 42 but Birthdate = 2010 is inconsistent.

● Outdated data reduces reliability.

● Trustworthiness of data sources.

● Data should not contain fake or intentionally incorrect values.

● Data should be understandable and properly documented (metadata, units, codes)

● Filling missing values

● Smoothing noisy data

● Identifying or removing outliers

● Dimensionality reduction (feature selection and extraction)

● Numerosity reduction (histograms, sampling, clustering)

● Data compression (lossy/lossless)

● Discretization and concept hierarchy generation

● Deleted because of inconsistencies

● Certain values not considered important during data entry

● Historical data not updated or maintained

1. Ignore the tuple

2. Fill manually

3. Fill with a global constant

4. Fill with attribute mean

5. Class-wise mean

6. Most probable value

○ Divide into equal-frequency bins

○ Fit a regression function to data (linear or polynomial)

○ Smooth values based on predicted trends

○ Outliers form small clusters

○ These can be removed or treated separately

4. Human–computer inspection

○ Suspicious or extreme values manually checked and corrected

● Naming conventions differ

● Code systems change over time

● Different versions of records exist

● Duplicate records have conflicting values

● Data discrepancy detection: Using metadata such as domain, ranges, rules

● Field checks: Uniqueness rule, null rule, format rule

● Data scrubbing: Spell-check, postal code validation

● Data auditing: Discover hidden rules using correlation or clustering

● Interactive cleaning tools: Like Potter’s Wheel for iterative correction

● Same attribute may have different names in different sources.

● Attributes may have different data types (e.g., integer vs string).

To create a unified schema without ambiguity or naming conflicts.

● Standard naming conventions

● “Bill Clinton”, “William Clinton”, “W. J. Clinton”

● Customer ID in one file vs Email ID in another

● Phone number formats: (555)123-4567 vs 5551234567

● Different naming conventions

● Missing unique identifiers

● String similarity measures (Levenshtein distance)

● Weight stored as kg in one system, lbs in another

● Distance recorded as miles in one source, kilometers in another

● Temperature in Celsius vs Fahrenheit

● Currency stored in USD vs INR

● Male/Female vs M/F vs 1/0

● Department codes: HR vs Human Resources

● Monthly sales vs yearly sales

● City vs region level information

● Common coding systems

● Data transformation rules

Same information stored under different names in multiple tables.

● Annual revenue can be derived from monthly revenue

● Age can be calculated from birthdate

Checks if two attributes are strongly related.

Measures how two attributes vary together.

● Ensures reliable and trusted data

● Reduces storage and speeds up mining

● Essential for building integrated data warehouses and OLAP systems

● Difficult to visualize or interpret

● Computation becomes expensive

● Leads to “curse of dimensionality”

● Many features may be correlated or useless

● Use statistical tests such as correlation, chi-square, mutual information.

● Independent of learning algorithms.

● Evaluate subsets by running a model (e.g., accuracy-based selection).

● More accurate but computationally expensive.

● Feature selection happens inside the model.

● Creates new orthogonal components.