-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Problem: Head-of-Line Blocking in PageFrameSequence
Background
PageFrameSequence is the central coordinator for parallel query execution in QuestDB. It dispatches page frame tasks to worker threads via a ring queue and collects results as workers complete them.
The current implementation uses an ordered collection model: tasks are dispatched to sequential ring queue slots, and the collectSubSeq (Single Consumer Sequence) advances through slots in FIFO order. A task at slot N can only be collected after slot N-1 has been collected.
The Problem
This design causes head-of-line blocking when workers have variable completion times.
Scenario:
- Frames 0, 1, 2, 3, 4 are dispatched to queue slots 100, 101, 102, 103, 104
- Worker A processes frame 0 (slot 100), Workers B-E process frames 1-4
- Worker A is slow (e.g., frame 0 has more rows, complex filter, or cold storage access)
- Workers B-E complete frames 1-4 quickly
Result:
collectSubSeqis blocked at slot 100 waiting for Worker A- Completed frames 1-4 sit in the queue, unable to be collected
- The queue fills up, blocking dispatch of new frames
- Work stealing can process other queued tasks but cannot help with the slow task already owned by Worker A
- Pipeline throughput degrades to single-worker speed until the slow task completes
Downsides
- Reduced parallelism: Completed work waits idle while a slow task blocks the pipeline
- Queue pressure: Full queue forces fallback to local execution (
workLocally()), losing parallelism for remaining frames - Latency spikes: A single slow partition/frame can significantly increase total query time
- Wasted resources: Worker threads that finished early sit idle waiting for collection to unblock
- Unnecessary complexity: Factories that don't need ordered results (GROUP BY, top K, etc.) pay the cost of ordered collection infrastructure
Observation
The mergeShards() method in AsyncGroupByRecordCursor already uses a simpler, non-blocking pattern:
// Count-based completion tracking
private final SOUnboundedCountDownLatch postAggregationDoneLatch;
// Simple publish loop
for (int i = 0; i < shardCount; i++) {
long cursor = pubSeq.next();
if (cursor < 0) {
// Queue full - do work locally
atom.mergeShard(-1, i);
} else {
queue.get(cursor).of(...);
pubSeq.done(cursor);
queuedCount++;
}
}
// Wait for completion count, not ordered collection
while (!postAggregationDoneLatch.done(queuedCount)) {
// Work steal from queue
}This pattern has no head-of-line blocking because completion is tracked by count, not by queue slot order.
Proposed Solution
Introduce a simplified PageFrameSequence variant (or mode) for factories that don't require ordered collection.
Key changes:
- Replace
collectSubSeq+ collect fan-out withSOUnboundedCountDownLatch - Workers call
latch.countDown()after processing (instead of relying on queue slot advancement) - Owner waits for
latch.done(frameCount)instead of ordered next()/collect() loop - Remove
collectedFrameIndextracking - Remove shards - a single, generously-sized queue would be enough
- Unlike
PageFrameReduceTask, tasks in the new queue don't need to include off-heap row id list: all frame memory pools and resources used by the worker threads should be located in the corresponding atom object
Benefits:
- No head-of-line blocking - any task can complete in any order
- Simpler code path - fewer sequences and fan-outs to manage
- Better throughput when frame processing times vary (different partition sizes, filter selectivity, etc.)
- Reduced latency variance
Applicable factories:
AsyncGroupByRecordCursorFactory- aggregates into maps, order irrelevantAsyncGroupByNotKeyedRecordCursorFactory- single aggregate result, order irrelevantAsyncTopKRecordCursorFactory- reduces into ordered row id lists, order irrelevant- Potentially other parallel factories that don't produce ordered output
Non-applicable factories:
- Any factory that must return results in timestamp/partition order
- Factories with filter + LIMIT that need early termination in order
Implementation Options
- Dual-mode
PageFrameSequence: Add a constructor flag or subclass for ordered vs. unordered collection - Separate class: Create
UnorderedPageFrameSequenceto avoid complexity in the existing class - Refactor to shared base: Extract common dispatch/work-stealing logic, with different collection strategies
Full Name:
Andrei Pechkurov
Affiliation:
QuestDB
Additional context
No response