Skip to content

Parallel execution for ORDER BY + LIMIT N in GROUP BY queries with large dataset #4085

@puzpuzpuz

Description

@puzpuzpuz

Is your feature request related to a problem?

#4032 added parallel execution for more parallel GROUP BY cases.

Specifically, this ClickBench query involves a large dataset (>90M rows), so we shard the underlying FastMap into smaller maps using radix partitioning:

SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

This means that LimitedSizeSortedLightRecordCursor which does the ORDER BY c DESC LIMIT 10 part operates on a ShardedMapCursor cursor which wraps the shard maps. This top K takes around 2 seconds since it runs on a single thread. We should detect the "potentially sharded result set" case, then run top K in parallel, and then do a K-way merge on top of the gathered results. This should speed up the top K part of the query execution significantly.

Describe the solution you'd like.

No response

Describe alternatives you've considered.

No response

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance improvementsSQLIssues or changes relating to SQL execution

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions