[Draft v2] Another Multi group by optimization#10976
[Draft v2] Another Multi group by optimization#10976jayzhan211 wants to merge 2 commits intoapache:mainfrom
Conversation
|
Great news! Although this approach is slightly slow on small string, but it is significant better if string is large. THIS PR Main |
alamb
left a comment
There was a problem hiding this comment.
This is very cool @jayzhan211 -- exactly the right idea I think
| SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100; | ||
| SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000; | ||
| SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000; | ||
| SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10; |
There was a problem hiding this comment.
This is a clever way to quickly iterate for benchmark tests 👍
| ); | ||
|
|
||
|
|
||
| let u64_vec: Vec<u64> = groups.iter().map(|&x| x as u64).collect(); |
There was a problem hiding this comment.
It might be faster to write into u64_vec directly rather than write to groups and then copy over
|
|
||
|
|
||
| // Index Array: [0, 1, 1, 0] | ||
| // Data Array: ['a', 'c'] |
There was a problem hiding this comment.
This code copies all the strings again, which likely takes significant time
One way to make this faster is likey to special case the "has no nulls" path (so we can avoid if let Some(.) check each row
The only other ways I can think to make this faster is to avoid copying the strings all together. The only way to do so I can figure are:
- Return a DictionaryArray as output
- Return a
StringViewArray(similar to DictionayArray) [Epic] Implement support forStringViewin DataFusion #10918
🤔
Maybe we can try to hack in the ability to have the HashAggregateExec return Dictionary(Int32, String) for these multi-column groups and see if we can show significant performance improvements
If so then we can figure out how to thread that ability through the engine.
|
BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to |
|
I am starting to play with this PR again |
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |

Which issue does this PR close?
Closes #.
Related to #9403
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?