0% found this document useful (0 votes)
60 views11 pages

Presto SQL Optimization

The document provides optimization techniques for Presto SQL, focusing on simplifying SQL queries, improving execution speed, and enhancing JOIN performance. Key strategies include using WITH clauses for readability, avoiding memory-intensive operations like UNION and ORDER BY, and optimizing JOINs by filtering data size and avoiding redundant tables. Additionally, it suggests using regular expressions for LIKE clauses, converting long IN lists into temporary tables, and ensuring efficient join order to prevent memory issues.

Uploaded by

devin.zhangzj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views11 pages

Presto SQL Optimization

The document provides optimization techniques for Presto SQL, focusing on simplifying SQL queries, improving execution speed, and enhancing JOIN performance. Key strategies include using WITH clauses for readability, avoiding memory-intensive operations like UNION and ORDER BY, and optimizing JOINs by filtering data size and avoiding redundant tables. Additionally, it suggests using regular expressions for LIKE clauses, converting long IN lists into temporary tables, and ensuring efficient join order to prevent memory issues.

Uploaded by

devin.zhangzj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Presto SQL Optimization

1. Simplify SQL
1.1. WITH statement (Common Table Expression)
Simplifying your SQL is the basic step to making it more readable and structured. The easiest
way is to use WITH clause.

If your query becomes complex or deeply nested, try to extract subqueries using WITH clause.
For example, the following query that has a nested subquery:

SELECT a,b,c
FROM (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a) tbl_alias

can be rewritten as follows:

WITH
tbl_alias AS
(
SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a
)

SELECT a,b,c
FROM tbl_alias

You can also enumerate multiple sub-queries in WITH clause, by using a comma:

WITH
tbl1 AS
(
SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a
),
tbl2 AS
(
SELECT a, AVG(d) AS d FROM another_tbl GROUP BY a
)

SELECT tbl1.*,tbl2.*
FROM tbl1
JOIN tbl2 ON tbl1.a = tbl2.a
2. Improve the speed of SQL
2.1. Avoid using ‘UNION’ and ‘ORDER BY’
Presto tracks the memory usage of each query. UNION and ORDER BY are some of the
memory-intensive operations. Sometimes the error message like ‘exceeded max(local) memory
xxGB error’ will appear if UNION or ORDER BY are used in queries.

● UNION
UNION of SQL queries performs duplicate elimination and requires a substantial amount of
memory:

SELECT ... FROM A


UNION
SELECT ... FROM B

UNION can be replaced by UNION ALL to avoid the duplicate elimination, which requires less
memory and is faster.

SELECT ... FROM A


UNION All
SELECT ... FROM B

● ORDER BY
ORDER BY is used when sorting the data by some condition(s):

SELECT ...
FROM order_mart__order_item_profile
WHERE ...
ORDER BY grass_date

Presto performs sorting using a single worker node. So the entire data set must fit within the
memory limit of a worker (usually less than 5GB). If you are sorting a small number of rows (eg:
~10,000 rows), using ORDER BY is fine. Please avoid using ORDER BY especially if your SQL
is for building Lumos dashboards because you can sort the data by Lumos dashboard settings.
Adding ORDER BY into your query will only make your dashboard load very slowly.

SELECT ...
FROM order_mart__order_item_profile
WHERE ...
2.2. Avoid using SELECT *
In order to preview the data of a table, the following query is commonly used:

SELECT *
FROM my_table

Presto uses JSON text for materializing query results. If the above table contains 100GB of
data, the coordinator transfers more than 100GB of JSON text to save the query result. So,
even if the query computation is almost finished, outputting JSON results takes long time.

If the table only contains 5 or 6 columns, it should be fine. But for some tables having more than
many columns like item_profile, which has around 100 columns, the best way to do is to select
the columns you would like to look at:

SELECT column A, column B ...


FROM my_table

2.3. Aggregate LIKE clauses into a regular expression

● Use ‘regexp_like’ to replace ‘LIKE’ clauses to filter data

Presto’s query optimizer is unable to improve queries where many LIKE clauses are used. As a
consequence, the query execution can be slower than expected. To improve the performance,
you can substitute a series of LIKE clauses chained in an OR with a single regexp_like clause.
For example:

SELECT ...
FROM
shopee_backend_db__logistic_log_v2_tab
WHERE
description LIKE '%ER298620960MY%'
OR description LIKE '%ER285342794MY%'
OR description LIKE '%ER300287576MY%'
OR description LIKE '%ER299556375MY%'
OR description LIKE '%ER297923933MY%'
...

can be optimized by replacing the LIKE clauses with a single regexp_like function:
SELECT ...
FROM shopee_backend_db__logistic_log_v2_tab
WHERE
regexp_like(description,'ER298620960MY|ER285342794MY|ER300287576MY|ER299556375MY|
...')

● Use ‘regexp_extract’ to replace ‘LIKE’ clauses to extract data

Assume we need to extract the brand information from an item name, then we may use a series
of CASE WHEN and LIKE clauses to match and extract the brand data. The SQL is like:

SELECT
CASE
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%acer%' then 'Acer'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%advance%' then 'Advance'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%alfalink%' then 'Alfalink'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%asus%' then 'ASUS'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%aula%' then 'Aula'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%brother%' then 'Brother'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%canon%' then 'Canon'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%cisco%' then 'Cisco'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%cliptec%' then 'Cliptec'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%corsair%' then 'Corsair'
WHEN concat(' ',lower((from_utf8(item_name))),' ') LIKE '%d-link%' then 'D-Link'

AS brand
FROM item_profile

can be optimized by using regexp_extract function:

SELECT
regexp_extract(lower(from_utf8(item_name)),'(asus|lenovo|hp|acer|dell|zyrex|...)')
AS brand
FROM item_profile

2.4. Convert long list of IN clause into a temporary table

Assume we have a long list of shopids (3363134, 5189076, …, 4062349), and we want to get
the total GMV of these shops on 2018-05-01. It takes a long time for presto to finish the
following SQL:
SELECT
sum(total_price_usd) AS GMV
FROM
order_mart__order_profile
WHERE
shopid in (3363134, 5189076, …, 4062349)
AND
grass_date = date('2018-05-01')

The above SQL can be improved as:

SELECT
sum(total_price_usd) AS GMV
FROM
order_mart__order_profile as order_info
JOIN
(SELECT
cast(shopid as bigint) AS shopid
FROM
(SELECT
split('3363134, 5189076, …, 4062349', ', ') as bar)
CROSS JOIN
UNNEST(bar) AS t(shopid)
) AS shop_info
ON
order_info.shopid = shop_info.shopid
WHERE
order_info.grass_date = date('2018-05-01')

The main idea of this optimization is to convert the list into a table. This method works well on
long list in Presto.
Split is to split the string by a delimiter, and returns an array. Unnest is to return a result column
table with one row for each element of an array.
3. Improve the JOIN performance
3.1. Avoid joining unused tables
Please make sure that you select data from each joined table. Otherwise, there is no need to
join the table. This may happen when you are changing the SQL but forget to remove unused
tables from your SQL.

3.2. Avoid joining redundant tables


First, let’s clarify the difference between redundant tables and unused tables above.

Unused tables are the tables that you joined in your query but didn’t select any column from. If
you remove the join clause with these tables, your query can also return the results you want.

Redundant tables are the tables that you don’t necessarily need to join to get desired data. One
common example is to join order_mart__order_profile with
shopee_order_details_db__order_details_tab or shopee_order_details_db__order_details_tab_extinfo.
In this case, order_details_tab is a redundant table, because almost all the useful and commonly
used data of order_details_tab is in order_mart__order_profile. You can only choose to use
order_mart__order_profile.

If you want to join table A with table B, make sure the data from table A cannot be replaced in
table B, otherwise, the join clause is redundant.

3.3. Narrow down the data size of joined tables


Order_mart__order_item_profile and item_profile are commonly used tables:

SELECT a.orderid, a.itemid, b.item_name, ...


FROM order_mart__order_item_profile AS a
JOIN item_profile AS b
ON a.itemid = b.itemid
AND a.modelid = b.modelid
WHERE a.grass_date >= date ‘2018-09-01’

From the above SQL, the data after ‘2018-09-01’ is needed. We can narrow down the size of
order_item_profile by filtering the date first before joining with item_profile. For item_profile, if only
three columns are needed: itemid, modelid and item_name, we can also narrow down
item_profile by selecting these three columns first before joining. Improved SQL can be:

WITH a AS
(
SELECT...
FROM order_mart__order_item_profile
WHERE grass_date >= DATE '2018-09-01'
),

b AS
(
SELECT itemid, modelid, item_name FROM item_profile
)

SELECT a.orderid, a.itemid, b.item_name...


FROM a JOIN b
ON a.itemid = b.itemid AND a.modelid = b.modelid

3.4. Use simple equi-joins


The equi-join concatenates tables by comparing join keys using the equal (=) operator. If this
comparison becomes complex, the join processing will slow down.

For example, if you want to join two tables with date string e.g., ‘2015-10-01’, but one of the
tables only has columns for year, month, day values, you may write the following query to
generate date strings:

SELECT a.date,
b.name
FROM left_table a
JOIN right_table b
ON a.date = CAST ( (b.year*10000 + b.month*100 + b.day) AS VARCHAR)

This query will delay the join processing since the join condition involves several expressions.
You can make it faster by pushing down this condition into a subquery to prepare a join key
beforehand:

SELECT a.date, b.name


FROM left_table a
JOIN (
SELECT
CAST((b.year * 10000 + b.month * 100 + b.day) as VARCHAR) date, -- generate join key
name
FROM right_table
)b
ON a.date = b.date -- Simple equi-join
In this example, join keys are a.date and b.datestr columns. Comparing VARCHAR strings is
much faster than comparing VARCHAR and expression result. In future, Presto would be able to
optimize this type of queries, but for now, you need to rewrite the query by hand.

3.5. Avoid joining one table multiple times


The following SQL shows how to get the number of gross order, net order and cancelled order
at the same time:

WITH gross AS
(
SELECT grass_date, ...
COUNT(DISTINCT orderid) AS gross_orders
FROM order_mart__order_profile
GROUP BY 1,...
),
net AS
(
SELECT grass_date, ...
COUNT(DISTINCT orderid) AS net_orders
FROM order_mart__order_profile
WHERE be_status IN (...)
GROUP BY 1,...
),
cancel AS
(
SELECT grass_date, ...
COUNT(DISTINCT orderid) AS cancel_orders
FROM order_mart__order_profile
WHERE be_status IN (...)
GROUP BY 1,...
)

SELECT gross.grass_date, ….,


gross.gross_orders,
net.net_orders,
cancel.cancel_orders
FROM gross
JOIN net ON gross.grass_date = net.grass_date and ….
JOIN cancel ON gross.grass_date = cancel.grass_date and ...

Based on the query above, order_profile is joined twice to get gross orders, net orders and
cancelled orders at the same time. A better way to do the same thing is to use CASE WHEN
statement, here is the improved SQL:
SELECT grass_date,...
COUNT(DISTINCT orderid ) AS gross_orders,
COUNT(DISTINCT CASE WHEN be_status IN (...) THEN orderid ELSE NULL END) AS
net_orders,
COUNT(DISTINCT CASE WHEN be_status IN (...) THEN orderid ELSE NULL END) AS
cancel_orders
FROM order_mart__order_profile
GROUP BY 1,...

3.6. Join from the largest tables to smaller tables


The following type of queries that starts with a small table in a join clause usually causes Presto
to run against its memory limits:

SELECT *
FROM small_table JOIN large_table
ON small_table.id = large_table.id

Presto performs broadcast join by default, which partitions the left-hand side table into several
worker nodes, then sends the entire copy of the right-hand side table to the worker nodes that
have a partition. If the right-hand side table is large and doesn’t fit in memory on a worker node,
it will cause an error. Reordering the table join putting the largest table first will solve the issue:

SELECT *
FROM large_table JOIN small_table
ON large_table.id = small_table.id

This query distributes the left table (large_table), greatly reducing the chance of hitting the
memory limit.

3.7. Use distributed joins


If your query still doesn’t work, try distributed join by adding a magic comment that sets a
session property:

-- set session distributed_join = 'true'


SELECT * FROM large_table JOIN small_table
ON small_table.id = large_table.id

The distributed join algorithms partitions both left and right-hand side tables by using hash
values of the join key(s) as a partitioning key. So it would work even if the right-hand side table
is large. A downside is it increases the number of network data transfers and is usually slower
than the broadcast join.
4. Some tricky column extractions
4.1. shopee_item_db__collection_tab.data
This query extracts the shopid and itemid list from column “data” into rows for a particular
collection_id.

SELECT
collection_id,
result['shopid'] AS shopid,
result['itemid'] AS itemid
FROM
(
SELECT
collection_id,
CAST(json_parse(data) AS ARRAY(MAP(VARCHAR,BIGINT))) AS data_array
FROM
shopee_item_db__collection_tab
WHERE
collection_id = 91267
)
CROSS JOIN UNNEST(data_array) AS t(result)

4.2. shopee_account_db__shop_tab_extinfo.logistics_info
This query gives the logistics channelid and whether this channelid is enabled for a particular
shopid. Filter enabled in (‘1’,’true’) to extract only those enabled logistics channels.

SELECT
parent_key as shopid,
channelid,
channel_value,
cast(element_at(channel_value,'enabled') as varchar) as enabled
FROM shopee_account_db__shop_tab_extinfo
CROSS JOIN UNNEST
(CAST(json_extract(from_utf8(from_base64(regexp_replace(logistics_info, '\s'))),
'$.channels') AS MAP(VARCHAR, MAP(VARCHAR,VARCHAR))))
AS t(channelid,channel_value)
WHERE logistics_info is not NULL
4.3. shopee_item_v2_db__item_v2_tab_extinfo.attributes
This query extracts the brand of an item from the “attributes” column in the item_v2_tab_extinfo.

SELECT
item_attr.itemid,
item_attr.attr_id,
name,
display_name,
attr_value AS brand
FROM

(SELECT
parent_key AS itemid,
attr['attr_id'] AS attr_id,
attr['value'] AS attr_value
FROM shopee_item_v2_db__item_v2_tab_extinfo ext
CROSS JOIN UNNEST
(CAST (json_extract (attributes,'$.values') AS ARRAY (MAP (VARCHAR,VARCHAR)))) AS t (attr)
) AS item_attr

JOIN shopee_item_db__item_attribute_tab attr_meta


ON CAST (item_attr.attr_id AS INT) = CAST (attr_meta.attr_id AS INT)
JOIN shopee_item_db__item_attribute_tab_extinfo ext
ON CAST (attr_meta.attr_id AS INT) = CAST (ext.parent_key AS INT)
WHERE ext.brand_option = 1

You might also like