0% found this document useful (0 votes)
16 views11 pages

Clean Messy Data Using SQL

Uploaded by

suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Clean Messy Data Using SQL

Uploaded by

suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Jayen Thakker MetricMinds.

in
Data Analytics Mentor Analytics for All

Data Analytics

How To Prepare
Messy Data for
Analysis using SQL?

How To Prepare Messy Data for Analysis using SQL? Swipe next
[Link] Swipe to
Analytics for All
continue reading

You work as a data analyst for a retail company that collects

daily sales data from various stores.

Let’s walk through the process of cleaning a dataset using

SQL with a real-world example.

Below is a sample of the SalesTransactions table:

TransactionID CustomerID TransactionDate ProductID QuantitySold Amount


Total

1001 2001 12/09/2024 P101 5 500

1002 2001 12/09/2024 P102 NULL 1500

1003 2002 11/09/2024 P103 3 450

1003 2002 11/09/2024 P103 3 450

1004 2003 10/09/2024 P104 10000 100000

1005 2004 09/09/2024 XYZ 2 300

Issues in the Data:

There are several issues in this table


Missing values (NULL in QuantitySold)
Duplicate transactions (same TransactionID repeated)
Inconsistent date format (dd/mm/yyyy instead of yyyy-mm-
dd)
Outliers in QuantitySold and TotalAmount (e.g., 10,000 units
sold)
Invalid ProductID (e.g., XYZ is not a valid product).
Let’s clean this data using SQL.

How To Prepare Messy Data for Analysis using SQL? 02


[Link] Swipe to
Analytics for All
continue reading

1. Handling Missing Values

Problem :

The QuantitySold column has a missing value (NULL) for

ProductID = P102. We can estimate this missing value based on

the TotalAmount and the product's price.

First, identify rows where QuantitySold is missing (NULL).

SELECT *

FROM SalesTransactions

WHERE QuantitySold IS NULL;

If a price for the product is available in a Products table, calculate

the missing QuantitySold by dividing TotalAmount by the price.

UPDATE SalesTransactions

SET QuantitySold = TotalAmount / (

SELECT PricePerUnit

FROM Products

WHERE [Link] = [Link]

WHERE QuantitySold IS NULL;

How To Prepare Messy Data for Analysis using SQL? 03


[Link] Swipe to
Analytics for All
continue reading

If you can’t calculate QuantitySold, you may choose to remove

the row:

DELETE FROM SalesTransactions

WHERE QuantitySold IS NULL;

After Handling Missing Values

We used SQL to calculate the missing QuantitySold by dividing

the TotalAmount by the product's price from another table

(Products).

The missing value in QuantitySold is now replaced by the correct

value 10.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12/09/2024 P101 5 500

1002 2001 12/09/2024 P102 10 5


1 00

How To Prepare Messy Data for Analysis using SQL? 04


[Link] Swipe to
Analytics for All
continue reading

Removing Duplicate Records

Problem :

There’s a duplicate record for TransactionID = 1003, where

the exact same transaction is repeated twice.

Duplicate records can lead to inaccurate analysis. We can

remove duplicates by using the ROW_NUMBER() function.

WITH DuplicateCheck AS (

SELECT *, ROW_NUMBER() OVER

(PARTITION BY TransactionID, CustomerID,

TransactionDate

ORDER BY TransactionID) AS RowNum

FROM SalesTransactions

DELETE FROM SalesTransactions

WHERE TransactionID IN (

SELECT TransactionID FROM DuplicateCheck WHERE RowNum

> 1

);

How To Prepare Messy Data for Analysis using SQL? 05


[Link] Swipe to
Analytics for All
continue reading

After Removing Duplicate Records

The duplicate row has been deleted, leaving only one record for

TransactionID = 1003.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12-09-2024 P101 5 500

1002 2001 12-09-2024 P102 10 1500

1003 2002 11-09-2024 P103 3 450

How To Prepare Messy Data for Analysis using SQL? 06


[Link] Swipe to
Analytics for All
continue reading

Standardizing Date Formats

Problem :

The TransactionDate is in different formats like dd/mm/yyyy (e.g.,

12/09/2024), which makes it inconsistent for analysis.

Now, let’s convert TransactionDate from the dd/mm/yyyy format

to yyyy-mm-dd for consistency.

UPDATE SalesTransactions

SET TransactionDate = STR_TO_DATE(TransactionDate, '%d/%m/

%Y')

WHERE TransactionDate LIKE '%/%';

After Standardizing Date Formats

All transaction dates are now in the same format: 2024-09-12,


making it easier for queries, comparisons, and reporting.

TransactionID CustomerID TransactionDate Prod uctID QuantitySold TotalAmount

1001 2001 12-09-2024 P101 5 500

1002 2001 12-09-2024 P102 10 1500

1003 2002 11-09-2024 P103 3 450

How To Prepare Messy Data for Analysis using SQL? 07


[Link] Swipe to
Analytics for All
continue reading

Identify & Handle Outliers

Problem :

There’s an abnormally high QuantitySold of 10,000 units for

TransactionID = 1004, which is likely an error or an extreme outlier.

We can flag or delete such extreme values.

Identify outliers in QuantitySold

SELECT *

FROM SalesTransactions

WHERE QuantitySold > 1000;

Remove the outlier

DELETE FROM SalesTransactions

WHERE QuantitySold > 1000;

After Identify & Handle Outliers

We flagged this as an outlier using a simple rule (e.g., quantities


greater than 1,000), and then removed this transaction from the data.

TransactionID CustomerID TransactionDate c D


Produ tI QuantitySold TotalAmount

1004 2003 Deleted

How To Prepare Messy Data for Analysis using SQL? 08


[Link] Swipe to
Analytics for All
continue reading

Validating Product IDs

Problem :

The ProductID for TransactionID = 1005 is XYZ, which is invalid

because there is no such product in the Products table.

Identify invalid ProductIDs

SELECT *

FROM SalesTransactions

WHERE ProductID NOT IN (SELECT ProductID FROM Products);

Remove rows with invalid ProductIDs

DELETE FROM SalesTransactions

WHERE ProductID NOT IN (SELECT ProductID FROM Products);

After Identify & Handle Outliers

The row with the invalid ProductID = XYZ has been deleted, leaving
only valid product entries.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1005 2004 Deleted

How To Prepare Messy Data for Analysis using SQL? 09


[Link] Swipe to
Analytics for All
continue reading

Final Cleaned Data

After performing these steps, your cleaned SalesTransactions

table would look something like this:

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12-09-2024 P101 5 500

1002 2001 12-09-2024 P102 10 1500

1003 2002 11-09-2024 P103 3 450

Summary

Handled missing values by filling or removing rows with NULL

values

Removed duplicates to ensure each transaction is unique

Standardized date formats for consistent querying

Handled outliers in QuantitySold and TotalAmount to ensure

realistic data

Validated ProductIDs by removing invalid entries.

By applying these SQL techniques, you ensure the dataset is clean,

accurate, and ready for analysis or reporting.

How To Prepare Messy Data for Analysis using SQL? 10


TM

Start Your Data


Analytics Journey
Today!

Learn. Practice. Get Mentored.


with [Link]

If you're interested in learning


data analytics in 90 Days.

DM “Course” to
join the course.

You might also like