0% found this document useful (0 votes)
2 views21 pages

DM Data Preprocessing

Data preprocessing is crucial in data mining and machine learning to enhance data quality by addressing issues such as incompleteness, noise, and inconsistency. Key tasks include data cleaning, integration, reduction, and transformation, each aimed at ensuring accurate and meaningful analysis. Proper preprocessing leads to better decision-making and insights by providing a clean, structured dataset for further analysis.

Uploaded by

Aarya Gharmalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

DM Data Preprocessing

Data preprocessing is crucial in data mining and machine learning to enhance data quality by addressing issues such as incompleteness, noise, and inconsistency. Key tasks include data cleaning, integration, reduction, and transformation, each aimed at ensuring accurate and meaningful analysis. Proper preprocessing leads to better decision-making and insights by providing a clean, structured dataset for further analysis.

Uploaded by

Aarya Gharmalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Quality and Data Preprocessing –

Data preprocessing is one of the most important steps in data mining and machine learning.
Raw data obtained from real-world sources is often incomplete, noisy, inconsistent,
outdated, or difficult to interpret. If we use such poor-quality data directly, the mining results
become incorrect or misleading. Hence preprocessing improves data quality and prepares data
for further analysis.

1. Data Quality: Why Preprocess the Data?


To perform meaningful analysis, we must ensure that the data satisfies several data quality
dimensions. Poor-quality data leads to wrong decisions.

Measures of Data Quality (A Multidimensional View)

1. Accuracy

●​ Data should be correct, valid, and free from errors.​

●​ Example: Salary = –10 is inaccurate.​

2. Completeness

●​ All required attributes must be recorded.​

●​ Missing values, unrecorded fields reduce completeness.​

3. Consistency

●​ Values should not conflict across the database.​

●​ Example: Age = 42 but Birthdate = 2010 is inconsistent.​

4. Timeliness
●​ Data should be updated at the right time.​

●​ Outdated data reduces reliability.​

5. Believability

●​ Trustworthiness of data sources.​

●​ Data should not contain fake or intentionally incorrect values.​

6. Interpretability

●​ Data should be understandable and properly documented (metadata, units, codes)

2. Major Tasks in Data Preprocessing


Data preprocessing includes several essential steps that convert raw data into a clean,
integrated, and well-structured form.

A. Data Cleaning
This step handles missing data, noisy data, and inconsistent data. It ensures correctness
and completeness.​
Main tasks include:

●​ Filling missing values​

●​ Smoothing noisy data​

●​ Identifying or removing outliers​

●​ Resolving inconsistencies​

B. Data Integration
Combining data from multiple sources such as relational databases, files, spreadsheets, or data
cubes. Integration helps remove redundancy and ensures a unified view of the data.

C. Data Reduction
It reduces the volume of data while preserving its analytical integrity.​
Includes:

●​ Dimensionality reduction (feature selection and extraction)​

●​ Numerosity reduction (histograms, sampling, clustering)​

●​ Data compression (lossy/lossless)​

D. Data Transformation and Discretization


Converting data into appropriate formats for mining.​
Includes:

●​ Normalization​

●​ Attribute construction​

●​ Aggregation​

●​ Discretization and concept hierarchy generation

3. Data Cleaning (Detailed)


Data cleaning is necessary because real-world data is rarely perfect. Problems arise due to
faulty instruments, typing mistakes, transmission errors, misunderstanding by users, or incorrect
data collection techniques.

1. Incomplete (Missing) Data


Reasons for Missing Data

●​ Equipment malfunction​

●​ Deleted because of inconsistencies​

●​ Certain values not considered important during data entry​


●​ User forgot to record details​

●​ Historical data not updated or maintained​

Handling Missing Data

1.​ Ignore the tuple​


Used when the class label is missing (in classification), but ineffective when many
values are missing.​

2.​ Fill manually​


Accurate but time-consuming and impractical for large datasets.​

3.​ Fill with a global constant​


Example: Replace with “unknown” — simple but may create a new class.​

4.​ Fill with attribute mean​


Suitable for numerical attributes but may reduce data variability.​

5.​ Class-wise mean​


Better than global mean because it considers class structure.​

6.​ Most probable value​


Using statistical or AI techniques such as Bayesian methods, regression, or decision
trees.​

2. Noisy Data
Noise refers to random errors or variations in data values.​
Causes include faulty sensors, typing errors, transmission issues, and format inconsistencies.

Handling Noisy Data

1.​ Binning​

○​ Sort data​

○​ Divide into equal-frequency bins​


○​ Smooth by mean, median, or bin boundaries​

2.​ Regression​

○​ Fit a regression function to data (linear or polynomial)​

○​ Smooth values based on predicted trends​

3.​ Clustering​

○​ Outliers form small clusters​

○​ These can be removed or treated separately​

4.​ Human–computer inspection​

○​ Suspicious or extreme values manually checked and corrected​

3. Inconsistent Data
Occurs when:

●​ Naming conventions differ​

●​ Code systems change over time​

●​ Different versions of records exist​

●​ Duplicate records have conflicting values​

Data Cleaning as a Process

●​ Data discrepancy detection: Using metadata such as domain, ranges, rules​

●​ Field checks: Uniqueness rule, null rule, format rule​

●​ Data scrubbing: Spell-check, postal code validation​

●​ Data auditing: Discover hidden rules using correlation or clustering​


●​ Data migration using ETL tools: For extraction, transformation, loading​

●​ Interactive cleaning tools: Like Potter’s Wheel for iterative correction​

4. Data Integration
Data integration is a crucial step in data preprocessing that involves combining data from
multiple sources—such as databases, data warehouses, flat files, sensors, and web logs—into
a single coherent, unified dataset.​
Since modern organizations store data across different platforms and formats, integration
ensures that all information is brought together correctly so meaningful analysis and data mining
can be performed.

The goal of data integration is to provide a consistent, accurate, non-redundant view of data
across the entire organization.

However, integration is challenging because data from different sources may have differences in
naming, formats, structures, measurement units, and levels of detail.

Main Issues in Data Integration

1. Schema Integration
Schema integration deals with merging the schemas (structures, tables, fields) of different
databases.

Problems

●​ Same attribute may have different names in different sources.​


Example:​

○​ A.cust-id​

○​ B.customer-number​

○​ C.cid​

●​ Attributes may have different data types (e.g., integer vs string).​


●​ Same concept may be represented differently (e.g., gender: M/F vs 0/1).​

Goal

To create a unified schema without ambiguity or naming conflicts.​


Techniques used include:

●​ Schema matching​

●​ Mapping rules​

●​ Standard naming conventions​

2. Entity Identification Problem


Entity identification refers to determining whether two records from different sources represent
the same real-world entity.

Examples

●​ “Bill Clinton”, “William Clinton”, “W. J. Clinton”​

●​ Customer ID in one file vs Email ID in another​

●​ Phone number formats: (555)123-4567 vs 5551234567​

Causes

●​ Different naming conventions​

●​ Abbreviations​

●​ Typographical differences​

●​ Missing unique identifiers​

Techniques used:
●​ Matching algorithms​

●​ String similarity measures (Levenshtein distance)​

●​ Domain knowledge​

●​ Standardization rules​

3. Detecting and Resolving Data Value Conflicts


Data value conflicts occur when the same attribute has different values in different systems.

Types of Conflicts

a) Different Measurement Units

●​ Weight stored as kg in one system, lbs in another​

●​ Distance recorded as miles in one source, kilometers in another​

b) Different Scales or Ranges

●​ Temperature in Celsius vs Fahrenheit​

●​ Currency stored in USD vs INR​

c) Different Coding Schemes

●​ Male/Female vs M/F vs 1/0​

●​ Department codes: HR vs Human Resources​

d) Different Levels of Granularity

●​ Monthly sales vs yearly sales​

●​ City vs region level information​


Resolving these conflicts requires:

●​ Standardization​

●​ Unit conversion​

●​ Common coding systems​

●​ Data transformation rules​

4. Handling Redundancy
When integrating multiple databases, redundancy is very common.

Types of Redundancy

a) Duplicate Attributes

Same information stored under different names in multiple tables.​


Example:

●​ cust_id​

●​ customer_id​

●​ id​

These need to be merged into one unique attribute.

b) Derivable Attributes

One attribute is computable from another.

Example:

●​ Annual revenue can be derived from monthly revenue​

●​ Age can be calculated from birthdate​

If such attributes are retained without checking, they cause:


●​ Inconsistency​

●​ Extra storage​

●​ Slower processing​

Thus, redundancy must be detected and removed.

Tools for Redundancy Detection


Data integration uses statistical tools like:

1. Correlation Analysis

Checks if two attributes are strongly related.​


High correlation often indicates redundancy.

2. Covariance Analysis

Measures how two attributes vary together.​


A high covariance may suggest duplication or dependency.

3. Metadata Analysis

Uses data descriptions, units, types, and constraints to detect redundant or overlapping data.

4. Domain Knowledge

Human experts help understand which attributes are essential and which can be removed.

Importance of Proper Data Integration


●​ Avoids duplicates and inconsistencies​

●​ Ensures reliable and trusted data​

●​ Reduces storage and speeds up mining​


●​ Produces meaningful and holistic insights​

●​ Essential for building integrated data warehouses and OLAP systems

5. Data Reduction
Data reduction is an important step in data preprocessing that aims to obtain a smaller, more
compact representation of the dataset without losing significant information.​
In real-world applications, data warehouses often contain gigabytes or terabytes of data.​
Running complex data mining algorithms on such huge datasets becomes slow, costly, and
inefficient.​
Data reduction techniques make analysis practical and faster by reducing volume but
preserving meaning.

Why Data Reduction?


Data reduction is necessary because:

1. Faster Data Mining

Algorithms run significantly faster on reduced datasets.

2. Less Storage

Compact representations save disk space and memory.

3. Improved Algorithm Performance

Many mining methods (clustering, classification, neural networks) work better when the dataset
is smaller and more manageable.

4. Easier visualization

High-dimensional data is difficult to visualize; reduction provides clearer insights.

A. Dimensionality Reduction
Dimensionality reduction deals with reducing the number of attributes or features.​
In many machine learning tasks, datasets have dozens or hundreds of features, many of which
may be irrelevant, redundant, or noisy.

Problems Caused by Too Many Features

●​ Difficult to visualize or interpret​

●​ Computation becomes expensive​

●​ Leads to “curse of dimensionality”​

●​ Many features may be correlated or useless​

Dimensionality reduction techniques help simplify the model.

1. Feature Selection
Feature selection chooses a subset of the original features that are most relevant.​
Instead of transforming features, it eliminates unnecessary ones.

Methods of Feature Selection

a) Filter Methods

●​ Use statistical tests such as correlation, chi-square, mutual information.​

●​ Independent of learning algorithms.​

b) Wrapper Methods

●​ Evaluate subsets by running a model (e.g., accuracy-based selection).​

●​ More accurate but computationally expensive.​

c) Embedded Methods

●​ Feature selection happens inside the model.​


●​ Example: Decision trees, LASSO regression.​

Feature selection reduces dimensionality without transforming the data, making it easier to
understand.

2. Feature Extraction
Unlike feature selection, feature extraction transforms the original attributes into a new, smaller
set of features.

Popular techniques:

a) PCA (Principal Component Analysis)

●​ Creates new orthogonal components.​

●​ Captures maximum variance using fewer dimensions.​

●​ Very effective when many features are correlated.​

Feature extraction changes the data representation but preserves important patterns.

Advantages of Dimensionality Reduction


1.​ Requires less storage​

2.​ Faster computation and model training​

3.​ Helps remove redundant and irrelevant features​

4.​ Reduces noise and overfitting​

5.​ Improves visualization​


Disadvantages
1.​ Possible loss of information​

2.​ PCA assumes linear relationships among variables​

3.​ Some techniques are not easily interpretable​

4.​ Difficult to choose the correct number of components​

B. Numerosity Reduction
Numerosity reduction reduces data volume by using fewer representations of the original data.​
Instead of keeping all data, we store a compressed model of the data.

1. Parametric Methods
These methods assume that data follows a particular model and store only model parameters,
not the actual data.

Examples:

●​ Regression Models​
Fit data using linear or nonlinear regression; only coefficients are stored.​

●​ Log-Linear Models​
Useful for multidimensional data; store parameters instead of full data cubes.​

Advantages:

●​ Highly compressed​

●​ Easy to store and compute​

Disadvantages:
●​ Accuracy depends on model fit​

2. Non-Parametric Methods
Do not assume any predefined model.

Examples:

●​ Histograms​
Group data into bins and store only bin boundaries.​

●​ Clustering​
Represent data by cluster centers; outliers removed or grouped separately.​

●​ Sampling​
Select a representative sample instead of the full dataset.​

Advantages:

●​ Flexible and simple​

●​ Effective for large datasets​

Disadvantages:

●​ May miss rare but important patterns​

C. Data Cube Aggregation


Data cube aggregation reduces data by summarizing it at higher granularity.​
For example:

●​ Daily sales → weekly or monthly sales​

●​ City-level data → region-level data​


This reduces the number of rows while retaining overall trends.

Advantages:

●​ Faster OLAP queries​

●​ Useful for dashboards and business analytics​

D. Data Compression
Data compression reduces storage requirements with or without losing information.

1. Lossless Compression

●​ No information is lost​

●​ Original data can be perfectly reconstructed​


Examples: Huffman coding, run-length encoding (RLE)​

2. Lossy Compression

●​ Some information is lost​

●​ Acceptable when exact accuracy is not needed​


Examples: JPEG, audio compression, dimensionality reduction techniques​

Lossy methods provide very high compression but may slightly reduce accuracy.

6. Data Transformation
Data transformation is a key step in data preprocessing. It involves converting data into a
suitable format or structure that makes mining, analysis, and modeling more effective. Since raw
data may exist in different scales, formats, or levels of detail, transformation helps standardize
and prepare data for algorithms such as classification, clustering, or association rule mining.

The main goal of data transformation is to improve the quality of data representation,
enhance accuracy, and reduce complexity without altering the essential meaning.
What is Data Transformation?
Data transformation is a process that maps the entire set of attribute values to new values
through mathematical, logical, or statistical operations.​
Each original value can be linked back to its transformed value.

Transformation improves:

●​ Data quality​

●​ Uniformity​

●​ Scalability​

●​ Interpretability​

●​ Mining efficiency​

Major Methods of Data Transformation

1. Smoothing
Smoothing techniques are used to remove noise (random errors or fluctuations) and make data
more consistent.

Methods of Smoothing:

●​ Binning:​
Sort data and divide into bins; smooth using mean, median, or boundaries.​

●​ Regression smoothing:​
Fit the data into a regression function (linear or nonlinear).​

●​ Moving averages:​
Use average of recent values to reduce fluctuations.​
Smoothing is commonly used in time-series data, sensor data, and noisy attributes.

2. Attribute/Feature Construction
This involves creating new attributes from existing ones to make patterns easier to discover.​
Constructed features often capture important relationships or domain knowledge better.

Examples:

●​ From birthdate, compute age.​

●​ Combine height and weight to create BMI.​

●​ From transaction timestamps, extract day of week or hour.​

Benefits:

●​ Improves model accuracy​

●​ Makes hidden patterns visible​

●​ Simplifies complex relationships​

This is also known as feature generation.

3. Aggregation
Aggregation summarizes or combines two or more attributes or tuples to create new,
higher-level information.

Examples:

●​ Daily sales → weekly → monthly sales​

●​ Summarizing customer transactions into total monthly spending​

●​ Using count, sum, average on large datasets​


Aggregation reduces data size (helpful for data reduction) and reveals higher-level trends used
in OLAP and data cube construction.

4. Normalization
Normalization is the process of scaling numerical data into a small, standard range, commonly
useful for algorithms that are distance-based (e.g., KNN, K-means) or gradient-based.

Types of Normalization:

(a) Min–Max Normalization

Maps values to a range [0,1] or any fixed range.

v′=v−minmax−minv' = \frac{v - min}{max - min}v′=max−minv−min​

(b) Z-Score Normalization (Standardization)

Maps data to mean 0 and standard deviation 1.

v′=v−μσv' = \frac{v - \mu}{\sigma}v′=σv−μ​

(c) Decimal Scaling Normalization

Moves decimal point based on maximum absolute value.

v′=v10jv' = \frac{v}{10^j}v′=10jv​

Normalization improves algorithm performance by reducing the impact of varying scales.

5. Discretization
Discretization converts continuous numeric attributes into categorical or interval-based values
(bins).​
It reduces the number of attribute values and simplifies analysis.

Examples:

●​ Age → {young, adult, senior}​


●​ Income → {low, medium, high}​

Methods:

●​ Binning​

●​ Entropy-based splitting​

●​ Equal-width or equal-frequency partitioning​

Discretization is also the first step toward concept hierarchy generation.

6. Concept Hierarchy Generation


Concept hierarchies allow data to be viewed at different levels of abstraction.​
It generalizes low-level (detailed) data into high-level (summarized) concepts.

Examples:

Location Hierarchy:

City → State → Country → Region

Product Hierarchy:

Item → Category → Department → All Products

Time Hierarchy:

Day → Month → Quarter → Year

Concept hierarchies enrich OLAP operations such as roll-up and drill-down.

Why Data Transformation Is Important


●​ Helps machine learning algorithms understand data better​
●​ Reduces complexity and dimensionality​

●​ Removes noise and makes patterns clearer​

●​ Ensures uniform scaling of attributes​

●​ Enables better performance for distance-based models​

●​ Provides meaningful summaries for business decisions​

Without transformation, many algorithms would perform poorly or produce misleading results.

You might also like