perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality GROUP BY by puzpuzpuz · Pull Request #6582 · questdb/questdb

puzpuzpuz · 2025-12-29T15:37:48Z

This patch adds parallel execution for the ORDER BY + LIMIT (top K) phase in high-cardinality parallel GROUP BY queries. When the GROUP BY result is sharded (due to high cardinality), the top K selection now processes each shard in parallel using worker threads, then merges the per-shard results.

Changes:

New GroupByLongTopKJob and GroupByLongTopKTask for parallel top K processing
New message bus queue for distributing top K tasks to workers
AsyncGroupByRecordCursor#parallelLongTopK() orchestrates parallel execution when:
- The ORDER BY function is thread-safe
- The map is sharded
- Result size exceeds cairo.sql.parallel.groupby.topk.threshold
New configuration properties:
- cairo.sql.parallel.groupby.topk.threshold - minimum map size to enable parallel top K (default 5M)
- cairo.sql.parallel.groupby.topk.queue.capacity - task queue capacity (default - same as page frame reduce queue capacity)

Benchmarks

ClickBench run on Ryzen 7900x, 64GB RAM, Ubuntu 24.04.

The queries that benefit from parallel top K and their respective sharded map sizes:

Q12 - 6,019,102
Q14 - 6,474,212
Q15 - 17,630,976
Q16 - 24,070,560
Q18 - 56,384,822
Q30 - 5,730,331
Q31 - 13,172,392
Q32 - 99,997,493
Q33 - 18,342,019
Q34 - 18,342,019
Q35 - 9,762,046

ORDER BY + LIMIT phase (not full query execution) times:

Query	Before (ms)	After (ms)	Speedup	Improvement
Q12	53	8	6.6x	84.9%
Q14	26	11	2.4x	57.7%
Q15	95	11	8.6x	88.4%
Q16	70	23	3.0x	67.1%
Q18	161	47	3.4x	70.8%
Q30	9	5	1.8x	44.4%
Q31	20	12	1.7x	40.0%
Q32	115	90	1.3x	21.7%
Q33	95	15	6.3x	84.2%
Q35	56	5	11.2x	91.1%

Analysis of Q32 speed-up

Memory bandwidth analysis:

For Q32's map entries (GROUP BY WatchID long, ClientIP ipv4):

Key: ~12 bytes (long + ipv4)
Values: COUNT(*) + SUM + AVG internals ≈ 32+ bytes
~48 bytes per entry (with alignment)

Total data: 100M × 48 bytes ≈ 4.8 GB

Metric	Sequential	Parallel
Time	115ms	90ms
Effective bandwidth	~42 GB/s	~53 GB/s

Ryzen 7900X with DDR5 dual-channel has theoretical bandwidth of ~90 GB/s, practical peak ~60-70 GB/s. The sequential scan is already at ~60-70% of peak memory bandwidth. The parallel version pushes it closer to ~75-80%. We're hitting the memory bandwidth ceiling.

This explains the modest 1.3x improvement - it's not a CPU-bound workload where 23 workers would help proportionally. The single-threaded scan with hardware prefetching already saturates most of the available DRAM bandwidth. Additional threads only marginally improve memory-level parallelism. The queries with better speedups (Q35 at 11.2x) have smaller working sets that benefit more from cache effects and CPU parallelism rather than being memory-bound.

Queries

-- Q12
SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q14
SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q15
SELECT UserID, COUNT(*) AS c FROM hits GROUP BY UserID ORDER BY c DESC LIMIT 10;

-- Q16
SELECT UserID, SearchPhrase, COUNT(*) AS c FROM hits GROUP BY UserID, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q18
SELECT UserID, extract(minute FROM EventTime) AS m, SearchPhrase, COUNT(*) AS c FROM hits GROUP BY UserID, m, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q30
SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q31
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q32
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q33
SELECT URL, COUNT(*) AS c FROM hits ORDER BY c DESC LIMIT 10;

-- Q35
SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;

… GROUP BY

coderabbitai · 2025-12-29T15:37:58Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…l_long_topk

glasstiger · 2025-12-31T13:10:52Z

[PR Coverage check]

😞 fail : 49 / 209 (23.44%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/std/DirectLongLongSortedList.java	0	3	00.00%
🔵	io/questdb/cairo/map/OrderedMapFixedSizeCursor.java	0	1	00.00%
🔵	io/questdb/griffin/engine/groupby/GroupByLongTopKJob.java	3	30	10.00%
🔵	io/questdb/tasks/GroupByLongTopKTask.java	4	30	13.33%
🔵	io/questdb/std/DirectLongLongAscList.java	1	7	14.29%
🔵	io/questdb/std/DirectLongLongDescList.java	1	7	14.29%
🔵	io/questdb/griffin/engine/table/AsyncGroupByRecordCursor.java	17	91	18.68%
🔵	io/questdb/griffin/engine/table/AsyncGroupByAtom.java	6	21	28.57%
🔵	io/questdb/cairo/DefaultCairoConfiguration.java	1	2	50.00%
🔵	io/questdb/MessageBusImpl.java	6	7	85.71%
🔵	io/questdb/PropertyKey.java	2	2	100.00%
🔵	io/questdb/cairo/CairoConfigurationWrapper.java	2	2	100.00%
🔵	io/questdb/PropServerConfiguration.java	5	5	100.00%
🔵	io/questdb/mp/WorkerPoolUtils.java	1	1	100.00%

perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality…

1190391

… GROUP BY

puzpuzpuz self-assigned this Dec 29, 2025

puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Dec 29, 2025

puzpuzpuz added 6 commits December 30, 2025 21:31

Remove todo and add meaningful comment

312a9eb

Merge remote-tracking branch 'upstream/master' into puzpuzpuz_paralle…

4731aba

…l_long_topk

Add a few tests

f8151e0

Change defaults

e1046af

Proper tests

66f78a9

Simplify code

bfe413e

puzpuzpuz marked this pull request as ready for review December 31, 2025 12:45

bluestreak01 approved these changes Dec 31, 2025

View reviewed changes

bluestreak01 merged commit 2dbe4ca into master Dec 31, 2025
42 of 43 checks passed

bluestreak01 deleted the puzpuzpuz_parallel_long_topk branch December 31, 2025 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality GROUP BY#6582

perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality GROUP BY#6582
bluestreak01 merged 7 commits intomasterfrom
puzpuzpuz_parallel_long_topk

puzpuzpuz commented Dec 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 29, 2025 •

edited

Loading

Review skipped

Uh oh!

glasstiger commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

puzpuzpuz commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Analysis of Q32 speed-up

Queries

Uh oh!

coderabbitai bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

glasstiger commented Dec 31, 2025

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

puzpuzpuz commented Dec 29, 2025 •

edited

Loading

coderabbitai bot commented Dec 29, 2025 •

edited

Loading