[Draft v2] Another Multi group by optimization by jayzhan211 · Pull Request #10976 · apache/datafusion

jayzhan211 · 2024-06-18T07:04:23Z

Which issue does this PR close?

Closes #.

Related to #9403

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2024-06-18T07:15:27Z

jayzhan211 · 2024-06-18T07:38:13Z

Great news! Although this approach is slightly slow on small string, but it is significant better if string is large.

THIS PR

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2873.6 ms and returned 10 rows
Query 0 iteration 1 took 2595.7 ms and returned 10 rows
Query 0 iteration 2 took 2320.4 ms and returned 10 rows
Query 0 iteration 3 took 2269.0 ms and returned 10 rows
Query 0 iteration 4 took 2321.2 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 2280.9 ms and returned 10 rows
Query 1 iteration 1 took 2679.1 ms and returned 10 rows
Query 1 iteration 2 took 2683.2 ms and returned 10 rows
Query 1 iteration 3 took 2534.8 ms and returned 10 rows
Query 1 iteration 4 took 2804.4 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 28927.7 ms and returned 10 rows
Query 2 iteration 1 took 27570.2 ms and returned 10 rows
Query 2 iteration 2 took 28776.2 ms and returned 10 rows
Query 2 iteration 3 took 32869.2 ms and returned 10 rows
Query 2 iteration 4 took 35559.9 ms and returned 10 rows
Done

Main

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2010.2 ms and returned 10 rows
Query 0 iteration 1 took 1896.5 ms and returned 10 rows
Query 0 iteration 2 took 1707.0 ms and returned 10 rows
Query 0 iteration 3 took 1776.9 ms and returned 10 rows
Query 0 iteration 4 took 1685.8 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 1695.3 ms and returned 10 rows
Query 1 iteration 1 took 1727.3 ms and returned 10 rows
Query 1 iteration 2 took 1704.5 ms and returned 10 rows
Query 1 iteration 3 took 1832.4 ms and returned 10 rows
Query 1 iteration 4 took 1709.8 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 57890.4 ms and returned 10 rows
Query 2 iteration 1 took 60489.2 ms and returned 10 rows
Query 2 iteration 2 took 59249.8 ms and returned 10 rows
Query 2 iteration 3 took 57315.8 ms and returned 10 rows
Query 2 iteration 4 took 53942.7 ms and returned 10 rows
Done

alamb

This is very cool @jayzhan211 -- exactly the right idea I think

alamb · 2024-06-18T11:17:32Z

benchmarks/queries/clickbench/queries.sql

-SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
-SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
-SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000;
+SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;


This is a clever way to quickly iterate for benchmark tests 👍

alamb · 2024-06-18T11:30:06Z

datafusion/physical-plan/src/aggregates/group_values/row.rs

+        );
+
+
+        let u64_vec: Vec<u64> = groups.iter().map(|&x| x as u64).collect();


It might be faster to write into u64_vec directly rather than write to groups and then copy over

alamb · 2024-06-18T11:36:29Z

datafusion/physical-plan/src/aggregates/group_values/row.rs

+
+
+                // Index Array: [0, 1, 1, 0]
+                // Data Array: ['a', 'c']


This code copies all the strings again, which likely takes significant time

One way to make this faster is likey to special case the "has no nulls" path (so we can avoid if let Some(.) check each row

The only other ways I can think to make this faster is to avoid copying the strings all together. The only way to do so I can figure are:

Return a DictionaryArray as output

Return a StringViewArray (similar to DictionayArray) [Epic] Implement support for StringView in DataFusion #10918

🤔

Maybe we can try to hack in the ability to have the HashAggregateExec return Dictionary(Int32, String) for these multi-column groups and see if we can show significant performance improvements

If so then we can figure out how to thread that ability through the engine.

alamb · 2024-06-21T22:28:31Z

BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to StringView" approach

alamb · 2024-06-28T20:17:01Z

I am starting to play with this PR again

Signed-off-by: jayzhan211 <[email protected]>

github-actions · 2024-08-28T01:54:37Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 18, 2024

This was referenced Jun 18, 2024

Improve performance for grouping by variable length columns (strings) #9403

Closed

[DRAFT 2 times slower] Multiple group by optimization #10937

Closed

alamb reviewed Jun 18, 2024

View reviewed changes

alamb mentioned this pull request Jun 18, 2024

Minor: reuse Rows buffer in GroupValuesRows #10980

Merged

jayzhan211 mentioned this pull request Jun 19, 2024

[Epic] Implement support for StringView in DataFusion #10918

Closed

16 tasks

alamb closed this Jun 28, 2024

alamb reopened this Jun 28, 2024

alamb and others added 2 commits June 28, 2024 16:19

another group by

58a1fc8

Signed-off-by: jayzhan211 <[email protected]>

update query

024d829

Signed-off-by: jayzhan211 <[email protected]>

alamb force-pushed the multi-group-v3 branch from e46acbc to 024d829 Compare June 28, 2024 20:19

alamb mentioned this pull request Jul 15, 2024

2024 Q3-Q4 Roadmap? #11442

Closed

This was referenced Jul 27, 2024

Better multi-column aggregation support with StringView #11684

Closed

Better multi-column aggregation support with StringView #11794

Closed

github-actions bot added the Stale PR has not had any activity for some time label Aug 28, 2024

github-actions bot closed this Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Draft v2] Another Multi group by optimization#10976

[Draft v2] Another Multi group by optimization#10976
jayzhan211 wants to merge 2 commits intoapache:mainfrom
jayzhan211:multi-group-v3

jayzhan211 commented Jun 18, 2024 •

edited by alamb

Loading

Uh oh!

jayzhan211 commented Jun 18, 2024

Uh oh!

jayzhan211 commented Jun 18, 2024 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 18, 2024

Uh oh!

alamb Jun 18, 2024

Uh oh!

alamb Jun 18, 2024

Uh oh!

alamb commented Jun 21, 2024

Uh oh!

alamb commented Jun 28, 2024

Uh oh!

github-actions bot commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		);


		let u64_vec: Vec<u64> = groups.iter().map(\|&x\| x as u64).collect();

Comments

Conversation

jayzhan211 commented Jun 18, 2024 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jayzhan211 commented Jun 18, 2024

Uh oh!

jayzhan211 commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 21, 2024

Uh oh!

alamb commented Jun 28, 2024

Uh oh!

github-actions bot commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayzhan211 commented Jun 18, 2024 •

edited by alamb

Loading

jayzhan211 commented Jun 18, 2024 •

edited

Loading