Skip to content

perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality GROUP BY#6582

Merged
bluestreak01 merged 7 commits intomasterfrom
puzpuzpuz_parallel_long_topk
Dec 31, 2025
Merged

perf(sql): parallel ORDER BY long_column LIMIT N for high-cardinality GROUP BY#6582
bluestreak01 merged 7 commits intomasterfrom
puzpuzpuz_parallel_long_topk

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Dec 29, 2025

This patch adds parallel execution for the ORDER BY + LIMIT (top K) phase in high-cardinality parallel GROUP BY queries. When the GROUP BY result is sharded (due to high cardinality), the top K selection now processes each shard in parallel using worker threads, then merges the per-shard results.

Changes:

  • New GroupByLongTopKJob and GroupByLongTopKTask for parallel top K processing
  • New message bus queue for distributing top K tasks to workers
  • AsyncGroupByRecordCursor#parallelLongTopK() orchestrates parallel execution when:
    • The ORDER BY function is thread-safe
    • The map is sharded
    • Result size exceeds cairo.sql.parallel.groupby.topk.threshold
  • New configuration properties:
    • cairo.sql.parallel.groupby.topk.threshold - minimum map size to enable parallel top K (default 5M)
    • cairo.sql.parallel.groupby.topk.queue.capacity - task queue capacity (default - same as page frame reduce queue capacity)

Benchmarks

ClickBench run on Ryzen 7900x, 64GB RAM, Ubuntu 24.04.

The queries that benefit from parallel top K and their respective sharded map sizes:

  • Q12 - 6,019,102
  • Q14 - 6,474,212
  • Q15 - 17,630,976
  • Q16 - 24,070,560
  • Q18 - 56,384,822
  • Q30 - 5,730,331
  • Q31 - 13,172,392
  • Q32 - 99,997,493
  • Q33 - 18,342,019
  • Q34 - 18,342,019
  • Q35 - 9,762,046

ORDER BY + LIMIT phase (not full query execution) times:

Query Before (ms) After (ms) Speedup Improvement
Q12 53 8 6.6x 84.9%
Q14 26 11 2.4x 57.7%
Q15 95 11 8.6x 88.4%
Q16 70 23 3.0x 67.1%
Q18 161 47 3.4x 70.8%
Q30 9 5 1.8x 44.4%
Q31 20 12 1.7x 40.0%
Q32 115 90 1.3x 21.7%
Q33 95 15 6.3x 84.2%
Q35 56 5 11.2x 91.1%
Analysis of Q32 speed-up

Memory bandwidth analysis:

For Q32's map entries (GROUP BY WatchID long, ClientIP ipv4):

  • Key: ~12 bytes (long + ipv4)
  • Values: COUNT(*) + SUM + AVG internals ≈ 32+ bytes
  • ~48 bytes per entry (with alignment)

Total data: 100M × 48 bytes ≈ 4.8 GB

Metric Sequential Parallel
Time 115ms 90ms
Effective bandwidth ~42 GB/s ~53 GB/s

Ryzen 7900X with DDR5 dual-channel has theoretical bandwidth of ~90 GB/s, practical peak ~60-70 GB/s. The sequential scan is already at ~60-70% of peak memory bandwidth. The parallel version pushes it closer to ~75-80%. We're hitting the memory bandwidth ceiling.

This explains the modest 1.3x improvement - it's not a CPU-bound workload where 23 workers would help proportionally. The single-threaded scan with hardware prefetching already saturates most of the available DRAM bandwidth. Additional threads only marginally improve memory-level parallelism. The queries with better speedups (Q35 at 11.2x) have smaller working sets that benefit more from cache effects and CPU parallelism rather than being memory-bound.

Queries
-- Q12
SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q14
SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q15
SELECT UserID, COUNT(*) AS c FROM hits GROUP BY UserID ORDER BY c DESC LIMIT 10;

-- Q16
SELECT UserID, SearchPhrase, COUNT(*) AS c FROM hits GROUP BY UserID, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q18
SELECT UserID, extract(minute FROM EventTime) AS m, SearchPhrase, COUNT(*) AS c FROM hits GROUP BY UserID, m, SearchPhrase ORDER BY c DESC LIMIT 10;

-- Q30
SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q31
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase IS NOT NULL GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q32
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

-- Q33
SELECT URL, COUNT(*) AS c FROM hits ORDER BY c DESC LIMIT 10;

-- Q35
SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;

@puzpuzpuz puzpuzpuz self-assigned this Dec 29, 2025
@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Dec 29, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 29, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz puzpuzpuz marked this pull request as ready for review December 31, 2025 12:45
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😞 fail : 49 / 209 (23.44%)

file detail

path covered line new line coverage
🔵 io/questdb/std/DirectLongLongSortedList.java 0 3 00.00%
🔵 io/questdb/cairo/map/OrderedMapFixedSizeCursor.java 0 1 00.00%
🔵 io/questdb/griffin/engine/groupby/GroupByLongTopKJob.java 3 30 10.00%
🔵 io/questdb/tasks/GroupByLongTopKTask.java 4 30 13.33%
🔵 io/questdb/std/DirectLongLongAscList.java 1 7 14.29%
🔵 io/questdb/std/DirectLongLongDescList.java 1 7 14.29%
🔵 io/questdb/griffin/engine/table/AsyncGroupByRecordCursor.java 17 91 18.68%
🔵 io/questdb/griffin/engine/table/AsyncGroupByAtom.java 6 21 28.57%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 1 2 50.00%
🔵 io/questdb/MessageBusImpl.java 6 7 85.71%
🔵 io/questdb/PropertyKey.java 2 2 100.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 2 2 100.00%
🔵 io/questdb/PropServerConfiguration.java 5 5 100.00%
🔵 io/questdb/mp/WorkerPoolUtils.java 1 1 100.00%

@bluestreak01 bluestreak01 merged commit 2dbe4ca into master Dec 31, 2025
42 of 43 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_parallel_long_topk branch December 31, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants