SQL for Data Analysis — Complete Guide
A compact, practical handbook covering SQL fundamentals, analysis patterns, joins, aggregation, window
functions, performance tips, and real-world examples. Designed for beginners and junior analysts.
Table of Contents
1. Quick overview
2. Example schema (sample tables)
3. Basics — SELECT, WHERE, LIMIT
4. Aggregation & GROUP BY
5. JOINs — combining tables
6. Subqueries and CTEs (common table expressions)
7. Window functions — analysis within partitions
8. CASE, conditional aggregation
9. Useful functions (dates, strings, casts)
10. Data cleaning & transformations
11. Performance tips & best practices
12. Common pitfalls & how to avoid them
13. Quick cheat-sheet (syntax highlight)
14. Practical examples — analytics recipes
15. Testing & validation
16. Exercises & answers
1. Quick overview
SQL (Structured Query Language) is the lingua franca for working with relational data. As a data analyst
you’ll use SQL to:
• extract and filter data,
• aggregate and summarize,
• join tables and prepare features,
• run window calculations (running totals, ranks),
• prepare clean datasets for BI tools or ML pipelines.
Use this guide as a hands-on reference: copy the examples into your SQL client and adapt to your schema.
1
2. Example schema (sample tables)
-- customers
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
city VARCHAR(50),
signup_date DATE
);
-- orders
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
amount DECIMAL(10,2),
product_category VARCHAR(50)
);
-- sales (another common layout)
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
customer_id INT,
sale_date DATE,
product VARCHAR(100),
revenue DECIMAL(12,2),
quantity INT
);
3. Basics — SELECT, WHERE, LIMIT
-- select specific columns
SELECT name, city FROM customers;
-- filtering
SELECT * FROM orders WHERE amount > 100.00 AND order_date >= '2024-01-01';
-- distinct values
SELECT DISTINCT product_category FROM orders;
-- limit rows (useful in big tables)
SELECT * FROM sales ORDER BY sale_date DESC LIMIT 10;
2
4. Aggregation & GROUP BY
-- total revenue by product
SELECT product, SUM(revenue) AS total_revenue
FROM sales
GROUP BY product
ORDER BY total_revenue DESC;
-- count of orders per customer
SELECT customer_id, COUNT(*) AS orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 3; -- filter groups
Tip: Every non-aggregated column in SELECT must be in GROUP BY (standard SQL).
5. JOINs — combining tables
-- inner join: customers with their orders (only if orders exist)
SELECT c.customer_id, c.name, o.order_id, o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- left join: keep all customers, orders if present
SELECT c.customer_id, c.name, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
-- full outer join: keep rows from both sides
SELECT c.customer_id, c.name, o.order_id
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;
Use LEFT JOIN when you need all rows from left table even if no match.
6. Subqueries and CTEs (common table expressions)
-- correlated subquery: latest order date per customer
SELECT c.customer_id, c.name,
(SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.customer_id) AS
last_order
3
FROM customers c;
-- CTE for readability
WITH customer_totals AS (
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT ct.customer_id, c.name, ct.total_spent
FROM customer_totals ct
JOIN customers c ON c.customer_id = ct.customer_id
ORDER BY ct.total_spent DESC;
CTEs make complex queries modular and readable.
7. Window functions — analysis within partitions
Window functions compute values over partitions without collapsing rows.
-- running total of revenue by customer (ordered by date)
SELECT sale_id, customer_id, sale_date, revenue,
SUM(revenue) OVER (PARTITION BY customer_id ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS
running_total
FROM sales;
-- rank customers by total revenue
WITH cust_rev AS (
SELECT customer_id, SUM(revenue) AS total_revenue
FROM sales
GROUP BY customer_id
)
SELECT customer_id, total_revenue,
RANK() OVER (ORDER BY total_revenue DESC) AS revenue_rank,
NTILE(4) OVER (ORDER BY total_revenue DESC) AS revenue_quartile
FROM cust_rev;
-- lag / lead to compare with previous row
SELECT sale_id, customer_id, sale_date, revenue,
LAG(revenue,1) OVER (PARTITION BY customer_id ORDER BY sale_date) AS
prev_revenue,
revenue - LAG(revenue,1) OVER (PARTITION BY customer_id ORDER BY sale_date)
AS diff_from_prev
FROM sales;
4
Common functions: ROW_NUMBER() , RANK() , DENSE_RANK() , NTILE(n) , LAG() , LEAD() ,
SUM() / AVG() OVER (...) .
8. CASE, conditional aggregation
-- categorize orders into high/medium/low
SELECT order_id, amount,
CASE
WHEN amount >= 1000 THEN 'High'
WHEN amount >= 500 THEN 'Medium'
ELSE 'Low'
END AS amount_band
FROM orders;
-- conditional counts
SELECT product_category,
SUM(CASE WHEN amount > 100 THEN 1 ELSE 0 END) AS high_value_orders,
COUNT(*) AS total_orders
FROM orders
GROUP BY product_category;
9. Useful functions (dates, strings, casts)
-- date truncation to month (Postgres syntax)
SELECT date_trunc('month', order_date) AS month, SUM(amount) FROM orders GROUP
BY 1 ORDER BY 1;
-- or standard: extract year/month
SELECT EXTRACT(YEAR FROM order_date) AS yr, EXTRACT(MONTH FROM order_date) AS
mth, SUM(amount)
FROM orders GROUP BY 1,2 ORDER BY 1,2;
-- string functions
SELECT UPPER(name), TRIM(name), SUBSTRING(name FROM 1 FOR 10) FROM customers;
-- casting
SELECT CAST(amount AS DECIMAL(10,2)) FROM orders; -- standard
5
10. Data cleaning & transformations
• Remove duplicates (Postgres example using ROW_NUMBER):
WITH duplicates AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY unique_col1, unique_col2 ORDER BY
(SELECT 0)) AS rn
FROM table_name
)
DELETE FROM table_name WHERE ctid IN (
SELECT ctid FROM duplicates WHERE rn > 1
);
• Replace NULLs:
SELECT COALESCE(email, '[email protected]') FROM users;
• Pivot (wide) via conditional aggregation:
SELECT product,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date)=1 THEN revenue ELSE 0 END)
AS Jan,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date)=2 THEN revenue ELSE 0 END)
AS Feb
FROM sales
GROUP BY product;
11. Performance tips & best practices
1. Avoid SELECT * — fetch only needed columns.
2. Index filter and join columns (e.g., CREATE INDEX idx_orders_customer ON
orders(customer_id); ).
3. Use EXPLAIN / EXPLAIN ANALYZE to see query plan.
4. Avoid functions on indexed columns in WHERE (use range queries instead).
5. Use LIMIT when exploring large tables.
6. Test CTE vs temp tables for performance differences in your DB.
7. Batch large deletes/updates, or use partitions to drop data fast.
8. Use appropriate datatypes — smaller types use less I/O.
9. Denormalize for analytics: star schema (facts + dimension tables) speeds up BI queries.
10. Materialized views for expensive aggregations (refresh periodically).
6
12. Common pitfalls & how to avoid them
• Wrong GROUP BY: include all non-aggregated columns.
• Misinterpreting NULLs: NULL is not equal to anything — use IS NULL .
• Implicit type conversion causing full table scans.
• Not using transactions for multi-step updates.
• Relying on placement guarantees from bootcamps without checking alumni data.
13. Quick cheat-sheet (syntax highlight)
Select & filter
SELECT col1, col2 FROM table WHERE col1 = 'x' AND col2 BETWEEN 10 AND 20 ORDER
BY col1 DESC LIMIT 100;
Aggregation
SELECT col, SUM(val) FROM table GROUP BY col HAVING SUM(val) > 1000;
Join
SELECT * FROM a INNER JOIN b ON a.id=b.a_id;
SELECT * FROM a LEFT JOIN b ON a.id=b.a_id;
CTE
WITH cte AS (SELECT ... )
SELECT * FROM cte WHERE ...;
Window
SELECT col, SUM(val) OVER (PARTITION BY col2 ORDER BY col3 ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) AS run_sum FROM table;
Pivot (manual)
7
SELECT key,
SUM(CASE WHEN category='A' THEN amount ELSE 0 END) AS A_amount,
SUM(CASE WHEN category='B' THEN amount ELSE 0 END) AS B_amount
FROM table GROUP BY key;
Subquery
SELECT * FROM customers WHERE customer_id IN (SELECT customer_id FROM orders
WHERE amount > 100);
14. Practical examples — analytics recipes
A. Cohort retention (monthly)
WITH first_order AS (
SELECT customer_id, MIN(DATE_TRUNC('month', order_date)) AS cohort_month
FROM orders
GROUP BY customer_id
),
monthly_orders AS (
SELECT customer_id, DATE_TRUNC('month', order_date) AS order_month
FROM orders
)
SELECT f.cohort_month,
mo.order_month,
COUNT(DISTINCT mo.customer_id) AS active_customers
FROM first_order f
JOIN monthly_orders mo ON f.customer_id = mo.customer_id
GROUP BY f.cohort_month, mo.order_month
ORDER BY f.cohort_month, mo.order_month;
B. Top 5 products by revenue last 3 months
SELECT product, SUM(revenue) AS revenue
FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY product
ORDER BY revenue DESC
LIMIT 5;
8
C. Churn code: customers with no orders in last 180 days
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
HAVING MAX(o.order_date) < CURRENT_DATE - INTERVAL '180 days' OR
MAX(o.order_date) IS NULL;
15. Testing & validation
• Use small sample queries to verify: SELECT COUNT(*) FROM table;
• Compare aggregates between raw and transformed tables.
• Spot-check a few records with ORDER BY and LIMIT .
• Write unit tests for SQL transformations if your stack supports them (dbt, Great Expectations).
16. Exercises & answers
Exercise 1 — Top customers by average order value
Task: Return top 10 customers by average order amount (min orders 3).
WITH cust_stats AS (
SELECT customer_id, AVG(amount) AS avg_order, COUNT(*) AS num_orders
FROM orders
GROUP BY customer_id
)
SELECT customer_id, avg_order, num_orders
FROM cust_stats
WHERE num_orders >= 3
ORDER BY avg_order DESC
LIMIT 10;
Exercise 2 — Monthly growth rate of revenue
Task: Calculate month-over-month revenue growth percentages.
WITH monthly AS (
SELECT DATE_TRUNC('month', sale_date) AS month, SUM(revenue) AS revenue
FROM sales
GROUP BY 1
9
)
SELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_rev,
CASE WHEN LAG(revenue) OVER (ORDER BY month) IS NULL THEN NULL
ELSE (revenue - LAG(revenue) OVER (ORDER BY month)) /
NULLIF(LAG(revenue) OVER (ORDER BY month),0) * 100
END AS mom_growth_pct
FROM monthly
ORDER BY month;
End of guide.
You can edit, copy, or download this document from the right-side canvas.
10