0% found this document useful (0 votes)
3 views29 pages

Data Cleaning Using SQL 1714249253

The document provides a comprehensive guide on data cleaning techniques using SQL, covering various methods such as removing duplicates, filtering rows, replacing null values, and converting data types. Each section includes explanations, examples, and considerations for best practices. It emphasizes the importance of data integrity, consistency, and proper formatting in data management.

Uploaded by

arkapro.das.1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

Data Cleaning Using SQL 1714249253

The document provides a comprehensive guide on data cleaning techniques using SQL, covering various methods such as removing duplicates, filtering rows, replacing null values, and converting data types. Each section includes explanations, examples, and considerations for best practices. It emphasizes the importance of data integrity, consistency, and proper formatting in data management.

Uploaded by

arkapro.das.1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data cleaning using SQL

Removing Duplicates
Explanation
SELECT DISTINCT * FROM table_name;

Example
SELECT DISTINCT customer_id, order_date
FROM orders;

Consideration
Check if all columns are needed for uniqueness.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Filtering Rows
Explanation
SELECT * FROM table_name WHERE
condition;

Example
SELECT * FROM sales
WHERE sales_amount > 0;

Consideration
Define clear conditions for filtering data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Replacing Null Values
Explanation
UPDATE table_name SET column_name =
value WHERE column_name IS NULL;

Example
UPDATE customers
SET region = 'Unknown'
WHERE region IS NULL;

Consideration
Consider default values that make sense contextually.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Converting Data Types
Explanation
CAST(column_name AS data_type)

Example
SELECT CAST(price AS INT)
FROM products;

Consideration
Ensure compatibility of data types in conversions.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Trimming Strings
Explanation
TRIM(column_name)

Example
SELECT TRIM(username)
FROM users;

Consideration
Remove unwanted spaces to standardize text data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Aggregating Data
Explanation
GROUP BY column_name

Example
SELECT COUNT(*), city
FROM customers
GROUP BY city;

Consideration
Useful for summarizing or grouping data for analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Sorting Data
Explanation
ORDER BY column_name

Example
SELECT * FROM employees
ORDER BY hire_date DESC;

Consideration
Sort data for better readability or analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Joining Tables
Explanation
JOIN other_table ON table_name.id =
other_table.foreign_key

Example
SELECT a.*, b.sales
FROM clients a
JOIN sales b ON a.id = b.client_id;

Consideration
Ensure joins are on indexed columns for performance.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Regular Expressions
Explanation
REGEXP_LIKE(column_name, pattern)

Example
SELECT *
FROM logs
WHERE REGEXP_LIKE(error_msg, '^Error.*');

Consideration
For pattern matching in text analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Data Deduplication
Explanation
Use row_number and partitions to
identify duplicates

Example
SELECT *,
ROW_NUMBER() OVER (PARTITION BY column_name
ORDER BY id) as rn
FROM table_name WHERE rn = 1;

Consideration
Important for maintaining a single source of truth.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Handling Date Anomalies
Explanation
Use date functions like DATE_PART

Example
UPDATE records
SET sale_date = '2020-01-01'
WHERE DATE_PART('year', sale_date) < 2000;

Consideration
Useful for correcting out-of-range or future dates.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Normalizing Data
Explanation
Use functions to standardize formats

Example
UPDATE products
SET price = ROUND(price, 2);

Consideration
Ensures data uniformity for numerical fields.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Anonymizing Data
Explanation
Replace sensitive data with generic
values

Example
UPDATE users
SET email = CONCAT('user', id, '@example.com');

Consideration
Important for privacy and compliance issues.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Imputing Missing Values
Explanation
Use statistical methods (mean, median)
within a window or group

Example
UPDATE sales
SET volume = COALESCE(volume, (SELECT AVG(volume)
FROM sales)) WHERE volume IS NULL;

Consideration
Choose appropriate method based on data distribution.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Encoding Categorical Data
Explanation
Convert categories to numeric codes

Example
UPDATE surveys
SET device_type = CASE
WHEN device_type = 'mobile' THEN 1 ELSE 2 END;

Consideration
Facilitates analytical models that require numeric inputs.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Parsing JSON or XML
Explanation
Extract elements from structured text
columns

Example
SELECT id, JSON_VALUE(data, '$.key') AS key_value
FROM logs;

Consideration
Useful for extracting information from semi-structured
data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Removing or Replacing Characters
Explanation
Use REPLACE or TRANSLATE function

Example
UPDATE comments
SET comment = REPLACE(comment,
'unwanted_phrase', '');

Consideration
Cleans up unwanted or sensitive text from strings.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Temporal Consistency Checks
Explanation
Validate sequences and intervals
between dates

Example
SELECT * FROM orders
WHERE order_date < previous_order_date;

Consideration
Ensures logical ordering and intervals in time series data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Hierarchical Data Management
Explanation
Manage and validate data hierarchies

Example
WITH RECURSIVE subordinates
AS (SELECT id
FROM employees WHERE manager_id = 'XYZ')
SELECT * FROM subordinates;

Consideration
Essential for managing nested or hierarchical relationships.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Data Type Validation
Explanation
Check if data conforms to expected
types

Example
SELECT * FROM transactions
WHERE NOT ISNUMERIC(amount);

Consideration
Essential to ensure data types align with expectations.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Logical Consistency Check
Explanation
Ensure data logically aligns across
related columns

Example
SELECT * FROM employees
WHERE end_date < start_date;

Consideration
Checks the logical sequence or relations in data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Compound Key Uniqueness
Explanation
Ensure uniqueness across combinations
of columns

Example
SELECT col1, col2, COUNT(*)
FROM table GROUP BY col1, col2 HAVING
COUNT(*) > 1;

Consideration
Useful for identifying duplicate records in multi-column
scenarios.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Advanced Pattern Matching
Explanation
Use advanced regex patterns

Example
SELECT * FROM users
WHERE postal_code NOT
REGEXP_LIKE(postal_code, '^[0-9]{5}$');

Consideration
Validate formats such as postal codes, phone numbers.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Multilingual Text Normalization
Explanation
Standardize text across different
languages

Example
UPDATE reviews SET comment =
CONVERT(comment USING utf8mb4) WHERE
lang = 'JP';

Consideration
Ensures uniform text representation for global data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Deduplication Using Hashing
Explanation
Use hashing for deduplication and
integrity checking

Example
SELECT HASHBYTES('SHA1', CONCAT(col1,
col2, col3)) AS rowhash, COUNT(*) FROM
table GROUP BY rowhash HAVING
COUNT(*) > 1;
Consideration
Identifies exact duplicates even in large datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Cross-Dataset Validation
Explanation
Ensure consistency between different
datasets

Example
SELECT a.id FROM datasetA a LEFT JOIN
datasetB b ON a.id = b.id WHERE b.id IS
NULL;

Consideration
Checks for missing or inconsistent data across datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Conditional Data Replacement
Explanation
Replace data based on specific
conditions

Example
UPDATE products SET status =
'discontinued' WHERE launch_date <
'2000-01-01';

Consideration
Contextually updates or corrects field values.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Version Control of Data
Explanation
Manage historical data versions

Example
SELECT * FROM employee_salaries FOR
SYSTEM_TIME AS OF '2020-01-01'

Consideration
Tracks changes over time for auditing or historical analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Cleaning Geographic Coordinates
Explanation
Validate and correct geographic data

Example
UPDATE properties SET lat = NULL, lon =
NULL WHERE lat NOT BETWEEN -90 AND 90
OR lon NOT BETWEEN -180 AND 180;

Consideration
Ensures geographic coordinates are within valid ranges.

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like