Advanced SQL Case Study
Advanced SQL Case Study
CASE STUDIES
DATA ANALYTICS | SQL CASE STUDY
01
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
02
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
03
TRY FOLLOWING QUESTIONS
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
04
1. What is the total amount each customer spent at
the restaurant?
SELECT s.customer_id, SUM(price) AS total_sales
FROM dbo.sales AS s
JOIN dbo.menu AS m
ON s.product_id = m.product_id
GROUP BY customer_id;
FROM dbo.sales AS s
JOIN dbo.menu AS m
ON s.product_id = m.product_id
)
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
05
SELECT customer_id, product_name
FROM ordered_sales_cte
WHERE rank = 1
GROUP BY customer_id, product_name;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
06
SELECT customer_id, product_name, order_count
FROM fav_item_cte
WHERE rank = 1;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
07
7. Which item was purchased just before the customer
became a member?
WITH prior_member_purchased_cte AS
(
SELECT s.customer_id, m.join_date, s.order_date, s.product_id,
DENSE_RANK() OVER(PARTITION BY s.customer_id
ORDER BY s.order_date DESC) AS rank
FROM sales AS s
JOIN members AS m
ON s.customer_id = m.customer_id
WHERE s.order_date < m.join_date
)
SELECT s.customer_id, s.order_date, m2.product_name
FROM prior_member_purchased_cte AS s
JOIN menu AS m2
ON s.product_id = m2.product_id
WHERE rank = 1;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
08
8. What is the total items and amount spent for each
member before they became a member?
SELECT s.customer_id, COUNT(DISTINCT s.product_id) AS
unique_menu_item, SUM(mm.price) AS total_sales
FROM sales AS s
JOIN members AS m
ON s.customer_id = m.customer_id
JOIN menu AS mm
ON s.product_id = mm.product_id
WHERE s.order_date < m.join_date
GROUP BY s.customer_id;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
09
SELECT s.customer_id, SUM(p.points) AS total_points
FROM price_points_cte AS p
JOIN sales AS s
ON p.product_id = s.product_id
GROUP BY s.customer_id
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
10
FROM dates_cte AS d
JOIN sales AS s
ON d.customer_id = s.customer_id
JOIN menu AS m
ON s.product_id = m.product_id
WHERE s.order_date < d.last_date
GROUP BY d.customer_id, s.order_date, d.join_date,
d.valid_date, d.last_date,
m.product_name, m.price
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
11
12) Rank All The Things
WITH summary_cte AS
(
SELECT s.customer_id, s.order_date, m.product_name,
m.price,
CASE
WHEN mm.join_date > s.order_date THEN 'N'
WHEN mm.join_date <= s.order_date THEN 'Y'
ELSE 'N' END AS member
FROM sales AS s
LEFT JOIN menu AS m
ON s.product_id = m.product_id
LEFT JOIN members AS mm
ON s.customer_id = mm.customer_id
)SELECT *, CASE
WHEN member = 'N' then NULL
ELSE
RANK () OVER(PARTITION BY customer_id, member
ORDER BY order_date) END AS ranking
FROM summary_cte;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
12
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
13
Table : runner_orders
After each orders are received through the system - they are
assigned to a runner - however not all orders are fully
completed and can be cancelled by the restaurant or the
customer.
The pickup_time is the timestamp at which the runner arrives
at the Pizza Runner headquarters to pick up the freshly
cooked pizzas. The distance and duration fields are related to
how far and long the runner had to travel to deliver the order
to the respective customer.
There are some known data issues with this table so be
careful when using this in your queries - make sure to check
the data types for each column in the schema SQL!
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
14
Table : pizza_names
At the moment - Pizza Runner only has 2 pizzas available the
Meat Lovers or Vegetarian!
Table: pizza_recipes
Each pizza_id has a standard set of toppings which are used
as part of the pizza recipe.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
15
Table: pizza_toppings
This table contains all of the topping_name values with their
corresponding topping_id value
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
16
TRY FOLLOWING QUESTIONS
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
17
TRY FOLLOWING QUESTIONS
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
18
1)Data Cleaning and Transformation
Firstly, to clean up exclusions and extras in the customer_orders
— we create TEMP TABLE #customer_orders and use CASE
WHEN.
In summary,
·pickup_time — Remove nulls and replace with ‘ ‘
·distance — Remove ‘km’ and nulls
·duration — Remove ‘minutes’ and nulls
·cancellation — Remove NULL and null and replace with ‘ ‘
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
19
·SELECT order_id, runner_id,
CASE
WHEN pickup_time LIKE 'null' THEN ' '
ELSE pickup_time
END AS pickup_time,
CASE
WHEN distance LIKE 'null' THEN ' '
WHEN distance LIKE '%km' THEN TRIM('km' from distance)
ELSE distance END AS distance,
CASE
WHEN duration LIKE 'null' THEN ' '
WHEN duration LIKE '%mins' THEN TRIM('mins' from duration)
WHEN duration LIKE '%minute' THEN TRIM('minute' from
duration)
WHEN duration LIKE '%minutes' THEN TRIM('minutes' from
duration)
ELSE duration END AS duration,
CASE
WHEN cancellation IS NULL or cancellation LIKE 'null' THEN ''
ELSE cancellation END AS cancellation
INTO #runner_orders
FROM runner_orders;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
20
ALTER TABLE #runner_orders
ALTER COLUMN pickup_time DATETIME,
ALTER COLUMN distance FLOAT,
ALTER COLUMN duration INT;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
21
ON c.pizza_id = p.pizza_id
WHERE r.distance != 0
GROUP BY p.pizza_name;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
22
8) For each customer, how many delivered pizzas had at
least 1 change and how many had no changes?
SELECT c.customer_id,
SUM(CASE
WHEN c.exclusions <> ' ' OR c.extras <> ' ' THEN 1
ELSE 0
END) AS at_least_1_change,
SUM(CASE
WHEN c.exclusions = ' ' AND c.extras = ' ' THEN 1
ELSE 0
END) AS no_change
FROM #customer_orders AS c
JOIN #runner_orders AS r
ON c.order_id = r.order_id
WHERE r.distance != 0
GROUP BY c.customer_id
ORDER BY c.customer_id;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
JOIN #runner_orders AS r
23
ON c.order_id = r.order_id
WHERE r.distance >= 1
AND exclusions <> ' '
AND extras <> ' ';
10) What was the total volume of pizzas ordered for each
hour of the day?
SELECT DATEPART(HOUR, [order_time]) AS hour_of_day,
COUNT(order_id) AS pizza_count
FROM #customer_orders
GROUP BY DATEPART(HOUR, [order_time]);
11) What was the volume of orders for each day of the
week?
SELECT FORMAT(DATEADD(DAY, 2, order_time),'dddd') AS
day_of_week,
-- add 2 to adjust 1st day of the week as Monday
COUNT(order_id) AS total_pizzas_ordered
FROM #customer_orders
GROUP BY FORMAT(DATEADD(DAY, 2, order_time),'dddd');
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
24
13) What was the average time in minutes it took for each
runner to arrive at the Pizza Runner HQ to pickup the
order?
WITH time_taken_cte AS
(
SELECT c.order_id, c.order_time, r.pickup_time,
DATEDIFF(MINUTE, c.order_time, r.pickup_time) AS
pickup_minutes
FROM #customer_orders AS c
JOIN #runner_orders AS r
ON c.order_id = r.order_id
WHERE r.distance != 0
GROUP BY c.order_id, c.order_time, r.pickup_time
)
SELECT AVG(pickup_minutes) AS avg_pickup_minutes
FROM time_taken_cte
WHERE pickup_minutes > 1;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
25
JOIN #runner_orders AS r
ON c.order_id = r.order_id
WHERE r.distance != 0
GROUP BY c.order_id, c.order_time, r.pickup_time
)
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
26
Then, we find the difference by deducting the shortest (MIN)
from the longest (MAX) delivery times.
SELECT MAX(duration::NUMERIC) - MIN(duration::NUMERIC) AS
delivery_time_difference
FROM #runner_orders
WHERE duration not like '% %'
17) What was the average speed for each runner for each
delivery and do you notice any trend for these values?
SELECT r.runner_id, c.customer_id, c.order_id,
COUNT(c.order_id) AS pizza_count,
r.distance, (r.duration / 60) AS duration_hr ,
ROUND((r.distance/r.duration * 60), 2) AS avg_speed
FROM #runner_orders AS r
JOIN #customer_orders AS c
ON r.order_id = c.order_id
WHERE distance != 0
GROUP BY r.runner_id, c.customer_id, c.order_id, r.distance,
r.duration
ORDER BY c.order_id;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
27
18) What is the successful delivery percentage for each
runner?
SELECT runner_id,
ROUND(100 * SUM
(CASE WHEN distance = 0 THEN 0
ELSE 1
END) / COUNT(*), 0) AS success_perc
FROM #runner_orders
GROUP BY runner_id;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
28
Events
Customer visits are logged in this events table at a cookie_id
level and the event_typeand page_id values can be used to
join onto relevant satellite tables to obtain further
information about each event. The sequence_number is used
to order the events within each visit.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
29
Event Identifier
The event_identifier table shows the types of events which are
captured by Clique Bait’s digital data systems.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
30
Campaign Identifier
This table shows information for the 3 campaigns that Clique
Bait has ran on their website so far in 2020.
Page Hierarchy
This table lists all of the pages on the Clique Bait website which
are tagged and have data passing through from user
interaction events.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
31
TRY FOLLOWING QUESTIONS
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
32
1. How many users are there?
SELECT COUNT(DISTINCT user_id) AS user_count
FROM clique_bait.users;
SELECT
ROUND(AVG(cookie_id_count),0) AS avg_cookie_id
FROM cookie;
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
33
3. What is the unique number of visits by all users per
month?
First, extract numerical month from event_time so that we
can group the data by month.
Unique is a keyword to use DISTINCT.
SELECT
EXTRACT(MONTH FROM event_time) as month,
COUNT(DISTINCT visit_id) AS unique_visit_count
FROM clique_bait.events
GROUP BY EXTRACT(MONTH FROM event_time);
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
34
5. What is the percentage of visits which have a purchase
event?
Join the events and events_identifier table and filter by
Purchase event only.
As the data is now filtered to having Purchase events only,
counting the distinct visit IDs would give you the number of
purchase events.
Then, divide the number of purchase events with a subquery
of total number of distinct visits from the events table.
SELECT
100 * COUNT(DISTINCT e.visit_id)/
(SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS
percentage_purchase
FROM clique_bait.events AS e
JOIN clique_bait.event_identifier AS ei
ON e.event_type = ei.event_type
WHERE ei.event_name = 'Purchase';
6. What is the percentage of visits which view the
checkout page but do not have a purchase event? The
strategy to answer this question is to breakdown the question
into 2 parts.
Part 1: Create a CTE and using CASE statements, find the MAX()
of:
event_type = 1 (Page View) and page_id = 12 (Checkout), and
assign "1" to these events. These events are when user
viewed the checkout page.
event_type = 3 (Purchase) and assign "1" to these events.
These events signifies users who made a purchase.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
35
We're using MAX() because we do not want to group the results
by event_type and page_id. Since the max score is "1", it would
mean "Give me the max score for each event".
Part 2: Using the table we have created, find the percentage of
visits which view checkout page.
WITH checkout_purchase AS (
SELECT
visit_id,
MAX(CASE WHEN event_type = 1 AND page_id = 12 THEN 1 ELSE
0 END) AS checkout,
MAX(CASE WHEN event_type = 3 THEN 1 ELSE 0 END) AS
purchase
FROM clique_bait.events
GROUP BY visit_id)
SELECT
ROUND(100 * (1-(SUM(purchase)::numeric/SUM(checkout))),2)
AS percentage_checkout_view_with_no_purchase
FROM checkout_purchase
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
36
GROUP BY ph.page_name
ORDER BY page_views DESC -- Order by descending to retrieve
highest to lowest number of views
LIMIT 3; -- Limit results to 3 to find the top 3
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
37
9)Using a single SQL query - create a new output table
which has the following details:
1. How many times was each product viewed?
2. How many times was each product added to cart?
3. How many times was each product added to a cart but not
purchased (abandoned)?
4. How many times was each product purchased?
Planning Our Strategy
Let us visualize the output table.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
38
Note 3 - In combined_table CTE, merge product_page_events
and purchase_events using LEFT JOIN. Take note of the table
sequence. In order to filter for visit ids with purchases, we
use a CASE statement and where visit id is not null, it means
the visit id is a purchase.
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
39
),
combined_table AS ( -- Note 3
SELECT
ppe.visit_id,
ppe.product_id,
ppe.product_name,
ppe.product_category,
ppe.page_view,
ppe.cart_add,
CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS
purchase
FROM product_page_events AS ppe
LEFT JOIN purchase_events AS pe
ON ppe.visit_id = pe.visit_id
),
product_info AS (
SELECT
product_name,
product_category,
SUM(page_view) AS views,
SUM(cart_add) AS cart_adds,
SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE
0 END) AS abandoned,
zepanalytics.com
DATA ANALYTICS | SQL CASE STUDY
40
SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE
0 END) AS purchases
FROM combined_table
GROUP BY product_id, product_name, product_category)
SELECT *
FROM product_info
ORDER BY product_id;
zepanalytics.com
Ready to take the next steps?
So these were some of the important DAX
functions which you should remember.
Explore
zepanalytics.com