Data cleaning using SQL
Removing Duplicates
Explanation
SELECT DISTINCT * FROM table_name;
Example
SELECT DISTINCT customer_id, order_date
FROM orders;
Consideration
Check if all columns are needed for uniqueness.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Filtering Rows
Explanation
SELECT * FROM table_name WHERE
condition;
Example
SELECT * FROM sales
WHERE sales_amount > 0;
Consideration
Define clear conditions for filtering data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Replacing Null Values
Explanation
UPDATE table_name SET column_name =
value WHERE column_name IS NULL;
Example
UPDATE customers
SET region = 'Unknown'
WHERE region IS NULL;
Consideration
Consider default values that make sense contextually.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Converting Data Types
Explanation
CAST(column_name AS data_type)
Example
SELECT CAST(price AS INT)
FROM products;
Consideration
Ensure compatibility of data types in conversions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Trimming Strings
Explanation
TRIM(column_name)
Example
SELECT TRIM(username)
FROM users;
Consideration
Remove unwanted spaces to standardize text data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Aggregating Data
Explanation
GROUP BY column_name
Example
SELECT COUNT(*), city
FROM customers
GROUP BY city;
Consideration
Useful for summarizing or grouping data for analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Sorting Data
Explanation
ORDER BY column_name
Example
SELECT * FROM employees
ORDER BY hire_date DESC;
Consideration
Sort data for better readability or analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Joining Tables
Explanation
JOIN other_table ON table_name.id =
other_table.foreign_key
Example
SELECT a.*, b.sales
FROM clients a
JOIN sales b ON a.id = b.client_id;
Consideration
Ensure joins are on indexed columns for performance.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Regular Expressions
Explanation
REGEXP_LIKE(column_name, pattern)
Example
SELECT *
FROM logs
WHERE REGEXP_LIKE(error_msg, '^Error.*');
Consideration
For pattern matching in text analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Data Deduplication
Explanation
Use row_number and partitions to
identify duplicates
Example
SELECT *,
ROW_NUMBER() OVER (PARTITION BY column_name
ORDER BY id) as rn
FROM table_name WHERE rn = 1;
Consideration
Important for maintaining a single source of truth.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Handling Date Anomalies
Explanation
Use date functions like DATE_PART
Example
UPDATE records
SET sale_date = '2020-01-01'
WHERE DATE_PART('year', sale_date) < 2000;
Consideration
Useful for correcting out-of-range or future dates.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Normalizing Data
Explanation
Use functions to standardize formats
Example
UPDATE products
SET price = ROUND(price, 2);
Consideration
Ensures data uniformity for numerical fields.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Anonymizing Data
Explanation
Replace sensitive data with generic
values
Example
UPDATE users
SET email = CONCAT('user', id, '@example.com');
Consideration
Important for privacy and compliance issues.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Imputing Missing Values
Explanation
Use statistical methods (mean, median)
within a window or group
Example
UPDATE sales
SET volume = COALESCE(volume, (SELECT AVG(volume)
FROM sales)) WHERE volume IS NULL;
Consideration
Choose appropriate method based on data distribution.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Encoding Categorical Data
Explanation
Convert categories to numeric codes
Example
UPDATE surveys
SET device_type = CASE
WHEN device_type = 'mobile' THEN 1 ELSE 2 END;
Consideration
Facilitates analytical models that require numeric inputs.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Parsing JSON or XML
Explanation
Extract elements from structured text
columns
Example
SELECT id, JSON_VALUE(data, '$.key') AS key_value
FROM logs;
Consideration
Useful for extracting information from semi-structured
data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Removing or Replacing Characters
Explanation
Use REPLACE or TRANSLATE function
Example
UPDATE comments
SET comment = REPLACE(comment,
'unwanted_phrase', '');
Consideration
Cleans up unwanted or sensitive text from strings.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Temporal Consistency Checks
Explanation
Validate sequences and intervals
between dates
Example
SELECT * FROM orders
WHERE order_date < previous_order_date;
Consideration
Ensures logical ordering and intervals in time series data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Hierarchical Data Management
Explanation
Manage and validate data hierarchies
Example
WITH RECURSIVE subordinates
AS (SELECT id
FROM employees WHERE manager_id = 'XYZ')
SELECT * FROM subordinates;
Consideration
Essential for managing nested or hierarchical relationships.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Data Type Validation
Explanation
Check if data conforms to expected
types
Example
SELECT * FROM transactions
WHERE NOT ISNUMERIC(amount);
Consideration
Essential to ensure data types align with expectations.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Logical Consistency Check
Explanation
Ensure data logically aligns across
related columns
Example
SELECT * FROM employees
WHERE end_date < start_date;
Consideration
Checks the logical sequence or relations in data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Compound Key Uniqueness
Explanation
Ensure uniqueness across combinations
of columns
Example
SELECT col1, col2, COUNT(*)
FROM table GROUP BY col1, col2 HAVING
COUNT(*) > 1;
Consideration
Useful for identifying duplicate records in multi-column
scenarios.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Advanced Pattern Matching
Explanation
Use advanced regex patterns
Example
SELECT * FROM users
WHERE postal_code NOT
REGEXP_LIKE(postal_code, '^[0-9]{5}$');
Consideration
Validate formats such as postal codes, phone numbers.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Multilingual Text Normalization
Explanation
Standardize text across different
languages
Example
UPDATE reviews SET comment =
CONVERT(comment USING utf8mb4) WHERE
lang = 'JP';
Consideration
Ensures uniform text representation for global data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Deduplication Using Hashing
Explanation
Use hashing for deduplication and
integrity checking
Example
SELECT HASHBYTES('SHA1', CONCAT(col1,
col2, col3)) AS rowhash, COUNT(*) FROM
table GROUP BY rowhash HAVING
COUNT(*) > 1;
Consideration
Identifies exact duplicates even in large datasets.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Cross-Dataset Validation
Explanation
Ensure consistency between different
datasets
Example
SELECT a.id FROM datasetA a LEFT JOIN
datasetB b ON a.id = b.id WHERE b.id IS
NULL;
Consideration
Checks for missing or inconsistent data across datasets.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Conditional Data Replacement
Explanation
Replace data based on specific
conditions
Example
UPDATE products SET status =
'discontinued' WHERE launch_date <
'2000-01-01';
Consideration
Contextually updates or corrects field values.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Version Control of Data
Explanation
Manage historical data versions
Example
SELECT * FROM employee_salaries FOR
SYSTEM_TIME AS OF '2020-01-01'
Consideration
Tracks changes over time for auditing or historical analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data cleaning using SQL
Cleaning Geographic Coordinates
Explanation
Validate and correct geographic data
Example
UPDATE properties SET lat = NULL, lon =
NULL WHERE lat NOT BETWEEN -90 AND 90
OR lon NOT BETWEEN -180 AND 180;
Consideration
Ensures geographic coordinates are within valid ranges.
Shwetank Singh
GritSetGrow - GSGLearn.com