Data Quality and Data Preprocessing –
Data preprocessing is one of the most important steps in data mining and machine learning.
Raw data obtained from real-world sources is often incomplete, noisy, inconsistent,
outdated, or difficult to interpret. If we use such poor-quality data directly, the mining results
become incorrect or misleading. Hence preprocessing improves data quality and prepares data
for further analysis.
1. Data Quality: Why Preprocess the Data?
To perform meaningful analysis, we must ensure that the data satisfies several data quality
dimensions. Poor-quality data leads to wrong decisions.
Measures of Data Quality (A Multidimensional View)
1. Accuracy
● Data should be correct, valid, and free from errors.
● Example: Salary = –10 is inaccurate.
2. Completeness
● All required attributes must be recorded.
● Missing values, unrecorded fields reduce completeness.
3. Consistency
● Values should not conflict across the database.
● Example: Age = 42 but Birthdate = 2010 is inconsistent.
4. Timeliness
● Data should be updated at the right time.
● Outdated data reduces reliability.
5. Believability
● Trustworthiness of data sources.
● Data should not contain fake or intentionally incorrect values.
6. Interpretability
● Data should be understandable and properly documented (metadata, units, codes)
2. Major Tasks in Data Preprocessing
Data preprocessing includes several essential steps that convert raw data into a clean,
integrated, and well-structured form.
A. Data Cleaning
This step handles missing data, noisy data, and inconsistent data. It ensures correctness
and completeness.
Main tasks include:
● Filling missing values
● Smoothing noisy data
● Identifying or removing outliers
● Resolving inconsistencies
B. Data Integration
Combining data from multiple sources such as relational databases, files, spreadsheets, or data
cubes. Integration helps remove redundancy and ensures a unified view of the data.
C. Data Reduction
It reduces the volume of data while preserving its analytical integrity.
Includes:
● Dimensionality reduction (feature selection and extraction)
● Numerosity reduction (histograms, sampling, clustering)
● Data compression (lossy/lossless)
D. Data Transformation and Discretization
Converting data into appropriate formats for mining.
Includes:
● Normalization
● Attribute construction
● Aggregation
● Discretization and concept hierarchy generation
3. Data Cleaning (Detailed)
Data cleaning is necessary because real-world data is rarely perfect. Problems arise due to
faulty instruments, typing mistakes, transmission errors, misunderstanding by users, or incorrect
data collection techniques.
1. Incomplete (Missing) Data
Reasons for Missing Data
● Equipment malfunction
● Deleted because of inconsistencies
● Certain values not considered important during data entry
● User forgot to record details
● Historical data not updated or maintained
Handling Missing Data
1. Ignore the tuple
Used when the class label is missing (in classification), but ineffective when many
values are missing.
2. Fill manually
Accurate but time-consuming and impractical for large datasets.
3. Fill with a global constant
Example: Replace with “unknown” — simple but may create a new class.
4. Fill with attribute mean
Suitable for numerical attributes but may reduce data variability.
5. Class-wise mean
Better than global mean because it considers class structure.
6. Most probable value
Using statistical or AI techniques such as Bayesian methods, regression, or decision
trees.
2. Noisy Data
Noise refers to random errors or variations in data values.
Causes include faulty sensors, typing errors, transmission issues, and format inconsistencies.
Handling Noisy Data
1. Binning
○ Sort data
○ Divide into equal-frequency bins
○ Smooth by mean, median, or bin boundaries
2. Regression
○ Fit a regression function to data (linear or polynomial)
○ Smooth values based on predicted trends
3. Clustering
○ Outliers form small clusters
○ These can be removed or treated separately
4. Human–computer inspection
○ Suspicious or extreme values manually checked and corrected
3. Inconsistent Data
Occurs when:
● Naming conventions differ
● Code systems change over time
● Different versions of records exist
● Duplicate records have conflicting values
Data Cleaning as a Process
● Data discrepancy detection: Using metadata such as domain, ranges, rules
● Field checks: Uniqueness rule, null rule, format rule
● Data scrubbing: Spell-check, postal code validation
● Data auditing: Discover hidden rules using correlation or clustering
● Data migration using ETL tools: For extraction, transformation, loading
● Interactive cleaning tools: Like Potter’s Wheel for iterative correction
4. Data Integration
Data integration is a crucial step in data preprocessing that involves combining data from
multiple sources—such as databases, data warehouses, flat files, sensors, and web logs—into
a single coherent, unified dataset.
Since modern organizations store data across different platforms and formats, integration
ensures that all information is brought together correctly so meaningful analysis and data mining
can be performed.
The goal of data integration is to provide a consistent, accurate, non-redundant view of data
across the entire organization.
However, integration is challenging because data from different sources may have differences in
naming, formats, structures, measurement units, and levels of detail.
Main Issues in Data Integration
1. Schema Integration
Schema integration deals with merging the schemas (structures, tables, fields) of different
databases.
Problems
● Same attribute may have different names in different sources.
Example:
○ A.cust-id
○ B.customer-number
○ C.cid
● Attributes may have different data types (e.g., integer vs string).
● Same concept may be represented differently (e.g., gender: M/F vs 0/1).
Goal
To create a unified schema without ambiguity or naming conflicts.
Techniques used include:
● Schema matching
● Mapping rules
● Standard naming conventions
2. Entity Identification Problem
Entity identification refers to determining whether two records from different sources represent
the same real-world entity.
Examples
● “Bill Clinton”, “William Clinton”, “W. J. Clinton”
● Customer ID in one file vs Email ID in another
● Phone number formats: (555)123-4567 vs 5551234567
Causes
● Different naming conventions
● Abbreviations
● Typographical differences
● Missing unique identifiers
Techniques used:
● Matching algorithms
● String similarity measures (Levenshtein distance)
● Domain knowledge
● Standardization rules
3. Detecting and Resolving Data Value Conflicts
Data value conflicts occur when the same attribute has different values in different systems.
Types of Conflicts
a) Different Measurement Units
● Weight stored as kg in one system, lbs in another
● Distance recorded as miles in one source, kilometers in another
b) Different Scales or Ranges
● Temperature in Celsius vs Fahrenheit
● Currency stored in USD vs INR
c) Different Coding Schemes
● Male/Female vs M/F vs 1/0
● Department codes: HR vs Human Resources
d) Different Levels of Granularity
● Monthly sales vs yearly sales
● City vs region level information
Resolving these conflicts requires:
● Standardization
● Unit conversion
● Common coding systems
● Data transformation rules
4. Handling Redundancy
When integrating multiple databases, redundancy is very common.
Types of Redundancy
a) Duplicate Attributes
Same information stored under different names in multiple tables.
Example:
● cust_id
● customer_id
● id
These need to be merged into one unique attribute.
b) Derivable Attributes
One attribute is computable from another.
Example:
● Annual revenue can be derived from monthly revenue
● Age can be calculated from birthdate
If such attributes are retained without checking, they cause:
● Inconsistency
● Extra storage
● Slower processing
Thus, redundancy must be detected and removed.
Tools for Redundancy Detection
Data integration uses statistical tools like:
1. Correlation Analysis
Checks if two attributes are strongly related.
High correlation often indicates redundancy.
2. Covariance Analysis
Measures how two attributes vary together.
A high covariance may suggest duplication or dependency.
3. Metadata Analysis
Uses data descriptions, units, types, and constraints to detect redundant or overlapping data.
4. Domain Knowledge
Human experts help understand which attributes are essential and which can be removed.
Importance of Proper Data Integration
● Avoids duplicates and inconsistencies
● Ensures reliable and trusted data
● Reduces storage and speeds up mining
● Produces meaningful and holistic insights
● Essential for building integrated data warehouses and OLAP systems
5. Data Reduction
Data reduction is an important step in data preprocessing that aims to obtain a smaller, more
compact representation of the dataset without losing significant information.
In real-world applications, data warehouses often contain gigabytes or terabytes of data.
Running complex data mining algorithms on such huge datasets becomes slow, costly, and
inefficient.
Data reduction techniques make analysis practical and faster by reducing volume but
preserving meaning.
Why Data Reduction?
Data reduction is necessary because:
1. Faster Data Mining
Algorithms run significantly faster on reduced datasets.
2. Less Storage
Compact representations save disk space and memory.
3. Improved Algorithm Performance
Many mining methods (clustering, classification, neural networks) work better when the dataset
is smaller and more manageable.
4. Easier visualization
High-dimensional data is difficult to visualize; reduction provides clearer insights.
A. Dimensionality Reduction
Dimensionality reduction deals with reducing the number of attributes or features.
In many machine learning tasks, datasets have dozens or hundreds of features, many of which
may be irrelevant, redundant, or noisy.
Problems Caused by Too Many Features
● Difficult to visualize or interpret
● Computation becomes expensive
● Leads to “curse of dimensionality”
● Many features may be correlated or useless
Dimensionality reduction techniques help simplify the model.
1. Feature Selection
Feature selection chooses a subset of the original features that are most relevant.
Instead of transforming features, it eliminates unnecessary ones.
Methods of Feature Selection
a) Filter Methods
● Use statistical tests such as correlation, chi-square, mutual information.
● Independent of learning algorithms.
b) Wrapper Methods
● Evaluate subsets by running a model (e.g., accuracy-based selection).
● More accurate but computationally expensive.
c) Embedded Methods
● Feature selection happens inside the model.
● Example: Decision trees, LASSO regression.
Feature selection reduces dimensionality without transforming the data, making it easier to
understand.
2. Feature Extraction
Unlike feature selection, feature extraction transforms the original attributes into a new, smaller
set of features.
Popular techniques:
a) PCA (Principal Component Analysis)
● Creates new orthogonal components.
● Captures maximum variance using fewer dimensions.
● Very effective when many features are correlated.
Feature extraction changes the data representation but preserves important patterns.
Advantages of Dimensionality Reduction
1. Requires less storage
2. Faster computation and model training
3. Helps remove redundant and irrelevant features
4. Reduces noise and overfitting
5. Improves visualization
Disadvantages
1. Possible loss of information
2. PCA assumes linear relationships among variables
3. Some techniques are not easily interpretable
4. Difficult to choose the correct number of components
B. Numerosity Reduction
Numerosity reduction reduces data volume by using fewer representations of the original data.
Instead of keeping all data, we store a compressed model of the data.
1. Parametric Methods
These methods assume that data follows a particular model and store only model parameters,
not the actual data.
Examples:
● Regression Models
Fit data using linear or nonlinear regression; only coefficients are stored.
● Log-Linear Models
Useful for multidimensional data; store parameters instead of full data cubes.
Advantages:
● Highly compressed
● Easy to store and compute
Disadvantages:
● Accuracy depends on model fit
2. Non-Parametric Methods
Do not assume any predefined model.
Examples:
● Histograms
Group data into bins and store only bin boundaries.
● Clustering
Represent data by cluster centers; outliers removed or grouped separately.
● Sampling
Select a representative sample instead of the full dataset.
Advantages:
● Flexible and simple
● Effective for large datasets
Disadvantages:
● May miss rare but important patterns
C. Data Cube Aggregation
Data cube aggregation reduces data by summarizing it at higher granularity.
For example:
● Daily sales → weekly or monthly sales
● City-level data → region-level data
This reduces the number of rows while retaining overall trends.
Advantages:
● Faster OLAP queries
● Useful for dashboards and business analytics
D. Data Compression
Data compression reduces storage requirements with or without losing information.
1. Lossless Compression
● No information is lost
● Original data can be perfectly reconstructed
Examples: Huffman coding, run-length encoding (RLE)
2. Lossy Compression
● Some information is lost
● Acceptable when exact accuracy is not needed
Examples: JPEG, audio compression, dimensionality reduction techniques
Lossy methods provide very high compression but may slightly reduce accuracy.
6. Data Transformation
Data transformation is a key step in data preprocessing. It involves converting data into a
suitable format or structure that makes mining, analysis, and modeling more effective. Since raw
data may exist in different scales, formats, or levels of detail, transformation helps standardize
and prepare data for algorithms such as classification, clustering, or association rule mining.
The main goal of data transformation is to improve the quality of data representation,
enhance accuracy, and reduce complexity without altering the essential meaning.
What is Data Transformation?
Data transformation is a process that maps the entire set of attribute values to new values
through mathematical, logical, or statistical operations.
Each original value can be linked back to its transformed value.
Transformation improves:
● Data quality
● Uniformity
● Scalability
● Interpretability
● Mining efficiency
Major Methods of Data Transformation
1. Smoothing
Smoothing techniques are used to remove noise (random errors or fluctuations) and make data
more consistent.
Methods of Smoothing:
● Binning:
Sort data and divide into bins; smooth using mean, median, or boundaries.
● Regression smoothing:
Fit the data into a regression function (linear or nonlinear).
● Moving averages:
Use average of recent values to reduce fluctuations.
Smoothing is commonly used in time-series data, sensor data, and noisy attributes.
2. Attribute/Feature Construction
This involves creating new attributes from existing ones to make patterns easier to discover.
Constructed features often capture important relationships or domain knowledge better.
Examples:
● From birthdate, compute age.
● Combine height and weight to create BMI.
● From transaction timestamps, extract day of week or hour.
Benefits:
● Improves model accuracy
● Makes hidden patterns visible
● Simplifies complex relationships
This is also known as feature generation.
3. Aggregation
Aggregation summarizes or combines two or more attributes or tuples to create new,
higher-level information.
Examples:
● Daily sales → weekly → monthly sales
● Summarizing customer transactions into total monthly spending
● Using count, sum, average on large datasets
Aggregation reduces data size (helpful for data reduction) and reveals higher-level trends used
in OLAP and data cube construction.
4. Normalization
Normalization is the process of scaling numerical data into a small, standard range, commonly
useful for algorithms that are distance-based (e.g., KNN, K-means) or gradient-based.
Types of Normalization:
(a) Min–Max Normalization
Maps values to a range [0,1] or any fixed range.
v′=v−minmax−minv' = \frac{v - min}{max - min}v′=max−minv−min
(b) Z-Score Normalization (Standardization)
Maps data to mean 0 and standard deviation 1.
v′=v−μσv' = \frac{v - \mu}{\sigma}v′=σv−μ
(c) Decimal Scaling Normalization
Moves decimal point based on maximum absolute value.
v′=v10jv' = \frac{v}{10^j}v′=10jv
Normalization improves algorithm performance by reducing the impact of varying scales.
5. Discretization
Discretization converts continuous numeric attributes into categorical or interval-based values
(bins).
It reduces the number of attribute values and simplifies analysis.
Examples:
● Age → {young, adult, senior}
● Income → {low, medium, high}
Methods:
● Binning
● Entropy-based splitting
● Equal-width or equal-frequency partitioning
Discretization is also the first step toward concept hierarchy generation.
6. Concept Hierarchy Generation
Concept hierarchies allow data to be viewed at different levels of abstraction.
It generalizes low-level (detailed) data into high-level (summarized) concepts.
Examples:
Location Hierarchy:
City → State → Country → Region
Product Hierarchy:
Item → Category → Department → All Products
Time Hierarchy:
Day → Month → Quarter → Year
Concept hierarchies enrich OLAP operations such as roll-up and drill-down.
Why Data Transformation Is Important
● Helps machine learning algorithms understand data better
● Reduces complexity and dimensionality
● Removes noise and makes patterns clearer
● Ensures uniform scaling of attributes
● Enables better performance for distance-based models
● Provides meaningful summaries for business decisions
Without transformation, many algorithms would perform poorly or produce misleading results.