0 ratings 0% found this document useful (0 votes) 88 views 14 pages Data Cleaning in SQL
The document outlines essential SQL data cleaning techniques necessary for accurate data analysis, including handling missing values, removing duplicates, standardizing data formats, and managing outliers. It provides practical examples and SQL code snippets for each technique, along with explanations of their importance. Additionally, the document includes common interview questions related to data cleaning in SQL to help assess readiness for SQL-focused roles.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Data cleaning in SQL For Later
QO Sravya Madipalli ©
Master
Data Cleaning
in SQLData cleaning is a critical step in any data
analysis or data science project. Without proper
data cleaning, your analysis may lead to
inaccurate or misleading results.
Today we will look into -
- Essential SQL data cleaning techniques
+ Practical examples to demonstrate each
concept
- Step-by-step strategies to help you clean
and prepare your data effectively
At the end, you'll also find common interview
questions to test your knowledge and readiness
for SQL-focused roles.Handling Missing Values
Missing values can lead to inaccurate analysis or cause
errors during joins and aggregations. SQL provides
several ways to deal with missing or null values.
Solution
Use COALESCE() or IFNULL() to replace missing values
with defaults.
Code Example:
sQu
COALESCE(email, ‘unknown') AS cleaned_email
UST
Explanation:
e COALESCE() returns the first non-null value from a
list of arguments.
e This query replaces any NULL values in the email
column with ‘unknown, ensuring data integrity.Duplicates in data can distort results and lead to
incorrect conclusions. SQL offers multiple ways to
identify and remove duplicates.
Use DISTINCT or ROW_NUMBER() to eliminate
duplicate rows.
UMUC eS
LECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY
created_at DESC) AS row_num
FROM orders
»)
1 OSU dy
WHERE row_num = 1;
e ROW_NUMBER() assigns a unique number to each row
within a partition defined by userid.
e This query keeps only the most recent row for each
userid, removing older duplicates.
e Useful when tracking unique users or transactions.Standardizing Data Formats
Inconsistent data formats, especially with text, can cause
issues when performing comparisons or analysis.
Solution: Use UPPER(), LOWER(), and TRIM() to standardize
text formats.
Code Example:
ie]
LOWER(first_name) AS standardized_name
customers;
Explanation:
e LOWER() converts text to lowercase, ensuring
consistent formatting.
e This is especially useful when performing
case-sensitive comparisons, avoiding mismatches
due to inconsistent capitalization.Outliers can distort the results of your analysis, affecting
averages, totals, and other metrics. Proper handling of
outliers is crucial.
Use statistical measures like AVG() and STDDEV() to
detect outliers
SELECT order_id, amount
FROM orders
WHERE amount > (SELECT AVG(amount) + 3 *
STDDEV(amount) FROM orders);
e This query identifies any amount values that are
more than three standard deviations above the
average.
e Such extreme values often represent outliers that can
skew your analysis.Use statistical measures like AVG() and STDDEV() to
detect outliers
DELETE FROM orders
WHERE amount > (SELECT AVG(amount) + 3 x
STDDEV(amount) FROM orders);
UTM}
ar ad AVGCamount) + 3 * STDDEV(amount) FROM orders)
ee AVG(amount) + 3 * STDDEV(amount) FROM orders);
e Deletes rows where amount exceeds the outlier
threshold.
e Limits the value of amount to a maximum value,
reducing the effect of outliers while preserving the row.Dates-Related Data Cleaning
Dates are critical for time-based analysis.
Standardizing date formats and extracting specific
components are common tasks.
Standardizing Date Formats
Ensure all dates follow a consistent format using
functions like TO_DATE().
TO_DATE(order_date, ‘YY 1} Caer Tat hed-le me Ebay
Cela -T aoe
Explanation:
e TO_DATE() converts a variety of date formats into a
standard YYYY-MM-DD format, ensuring consistency.Dates-Related Data Cleaning
Extracting Year, Month, or Day from Dates
Sometimes you need to break down a date into its
components for specific analyses, like grouping by
year or month.
sQL
G0 A Oda ee (CD
EXTRACT(MONTH FROM order_date) AS
orders;
Explanation:
e EXTRACT() pulls out individual components like year
or month from a date field.
e This is useful for time-based aggregations and
identifying trends over specific periods.Correcting Data Entry Errors
Manual data entry often leads to errors in format,
especially in fields like phone numbers or emails. These
errors can cause issues downstream in your analysis.
Solution: Use REGEXP to detect and correct formatting
errors.
Code Example:
Penmaes
Pita
phone_number NOT REGEXP.
Explanation:
e REGEXP is a regular expression function that allows
you to match patterns.
e This query finds phone numbers that don't match the
10-digit numeric format.
e By detecting such inconsistencies early, you can
avoid analysis errors later on.Handling Null Values in Aggregations
Null values in aggregations can cause incorrect results,
as they may be excluded from counts, sums, or averages.
Solution: Use COALESCE() or modify aggregation
functions to handle nulls.
Code Example:
SQL
ST SUM(COALESCE(amount, 6)) bee NOMe- ToL ine
| orders;
Explanation:
e COALESCE() replaces NULL values with 0 before
summing the amount.
e This ensures that null values do not lead to
inaccurate totalsRemoving Leading and Trailing Spaces
Extra spaces can cause comparison issues and lead to
inconsistent results, especially in text fields.
Solution: Use TRIM() to remove unnecessary whitespace.
Code Example:
SQL
TRIM(first_name) AS trimmed_name
Ue Weh 1-1
Explanation:
e TRIM() removes leading and trailing spaces, ensuring
consistent and clean data for comparisons and joins.Data Cleaning Interview Questions
How would you handle missing values in a dataset?
a. Discuss techniques such as using COALESCE() or
replacing nulls with averages or other default values.
What is the difference between removing and capping
outliers, and when would you use each?
a. Explain how removing outliers completely eliminates
them, while capping reduces their impact without
removing the data point.
Can you explain how you would standardize date formats in
SQL?
a. Walk through the process of using TO_DATE() or similar
functions to ensure consistency in date formats.
How can you remove duplicates from a dataset?
a. Discuss using DISTINCT or ROW_NUMBER() to identify
and eliminate duplicates.
What methods can you use to identify and correct data entry
errors, like incorrectly formatted phone numbers?
a. Explain how REGEXP can be used to identify patterns
and detect inconsistencies.
Why is it important to handle null values in aggregations,
and how would you do it in SQL?
a. Mention COALESCE() or handling nulls directly in
aggregate functions like SUM() or COUNT/().Q Sravya Madipalli @
Was this Helpful?
M Save it
+ Follow Me
es Repost and Share it
with your friends