perf(sql): parallel group by with optional filtering#4032
Merged
Conversation
…o puzpuzpuz_parallel_group_by # Conflicts: # core/src/main/java/io/questdb/cairo/map/FastMap.java
ideoma
previously approved these changes
Dec 21, 2023
ideoma
reviewed
Dec 21, 2023
ideoma
reviewed
Dec 21, 2023
ideoma
approved these changes
Dec 22, 2023
Collaborator
[PR Coverage check]😍 pass : 2012 / 2272 (88.56%) file detail
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, GROUP BY queries run in parallel only in a few cases:
select sum(value) from t.select int_key, sum(value) from t.This patch extends cases where we go with multi-threaded GROUP BY execution. The implementation builds on top of parallel SQL filters (a.k.a. async offload), so the same scheduling and cancellation behavior applies. The work is split into page frame tasks, aggregated by the shared workers, and accumulated in
FastMap(keyed GROUP BY) orSimpleMapValue(non-keyed GROUP BY).FastMap/SimpleMapValueis reused between different query executions (and different queries).As an optional second step in the query processing, we merge sharded maps in parallel. Shards contain non-intersecting sets of groups, so that once we have full shards, we return their rows to the caller. This behavior kicks in only in case of large enough maps (
cairo.sql.parallel.groupby.sharding.threshold, defaults to 10k).The implementation also "steals" filter from the underlying factory, so both of the following sample queries will be executed by the new parallel GROUP BY framework:
Currently supported aggregate functions:
count(*)andcount(col),avg,sum,min/max,vwap(all for fixed-size types).Benchmark results aren't included, but the improvement on my 4c/8t machine varies from 2x to 10x depending on the query.
The new behavior is enabled by default, but can be switched off with
cairo.sql.parallel.groupby.enabled=false.Also includes #4078 (single
count_distinctre-write to a parallel GROUP BY for all supported types except symbol).Other limitations
count_distinct, aren't yet supported.Next steps
CompactMapand introduceUnorderedMapfor the small fixed-size key-value case. That's to speed up key look-ups by avoiding extra access toFastMap's heap once we've determined the hash table slot.