Skip to content

Alternative PageFrameSequence implementation to get rid of Head-of-Line Blocking #6655

@puzpuzpuz

Description

@puzpuzpuz

Problem: Head-of-Line Blocking in PageFrameSequence

Background

PageFrameSequence is the central coordinator for parallel query execution in QuestDB. It dispatches page frame tasks to worker threads via a ring queue and collects results as workers complete them.

The current implementation uses an ordered collection model: tasks are dispatched to sequential ring queue slots, and the collectSubSeq (Single Consumer Sequence) advances through slots in FIFO order. A task at slot N can only be collected after slot N-1 has been collected.

The Problem

This design causes head-of-line blocking when workers have variable completion times.

Scenario:

  1. Frames 0, 1, 2, 3, 4 are dispatched to queue slots 100, 101, 102, 103, 104
  2. Worker A processes frame 0 (slot 100), Workers B-E process frames 1-4
  3. Worker A is slow (e.g., frame 0 has more rows, complex filter, or cold storage access)
  4. Workers B-E complete frames 1-4 quickly

Result:

  • collectSubSeq is blocked at slot 100 waiting for Worker A
  • Completed frames 1-4 sit in the queue, unable to be collected
  • The queue fills up, blocking dispatch of new frames
  • Work stealing can process other queued tasks but cannot help with the slow task already owned by Worker A
  • Pipeline throughput degrades to single-worker speed until the slow task completes

Downsides

  1. Reduced parallelism: Completed work waits idle while a slow task blocks the pipeline
  2. Queue pressure: Full queue forces fallback to local execution (workLocally()), losing parallelism for remaining frames
  3. Latency spikes: A single slow partition/frame can significantly increase total query time
  4. Wasted resources: Worker threads that finished early sit idle waiting for collection to unblock
  5. Unnecessary complexity: Factories that don't need ordered results (GROUP BY, top K, etc.) pay the cost of ordered collection infrastructure

Observation

The mergeShards() method in AsyncGroupByRecordCursor already uses a simpler, non-blocking pattern:

// Count-based completion tracking
private final SOUnboundedCountDownLatch postAggregationDoneLatch;

// Simple publish loop
for (int i = 0; i < shardCount; i++) {
    long cursor = pubSeq.next();
    if (cursor < 0) {
        // Queue full - do work locally
        atom.mergeShard(-1, i);
    } else {
        queue.get(cursor).of(...);
        pubSeq.done(cursor);
        queuedCount++;
    }
}

// Wait for completion count, not ordered collection
while (!postAggregationDoneLatch.done(queuedCount)) {
    // Work steal from queue
}

This pattern has no head-of-line blocking because completion is tracked by count, not by queue slot order.

Proposed Solution

Introduce a simplified PageFrameSequence variant (or mode) for factories that don't require ordered collection.

Key changes:

  • Replace collectSubSeq + collect fan-out with SOUnboundedCountDownLatch
  • Workers call latch.countDown() after processing (instead of relying on queue slot advancement)
  • Owner waits for latch.done(frameCount) instead of ordered next()/collect() loop
  • Remove collectedFrameIndex tracking
  • Remove shards - a single, generously-sized queue would be enough
  • Unlike PageFrameReduceTask, tasks in the new queue don't need to include off-heap row id list: all frame memory pools and resources used by the worker threads should be located in the corresponding atom object

Benefits:

  • No head-of-line blocking - any task can complete in any order
  • Simpler code path - fewer sequences and fan-outs to manage
  • Better throughput when frame processing times vary (different partition sizes, filter selectivity, etc.)
  • Reduced latency variance

Applicable factories:

  • AsyncGroupByRecordCursorFactory - aggregates into maps, order irrelevant
  • AsyncGroupByNotKeyedRecordCursorFactory - single aggregate result, order irrelevant
  • AsyncTopKRecordCursorFactory - reduces into ordered row id lists, order irrelevant
  • Potentially other parallel factories that don't produce ordered output

Non-applicable factories:

  • Any factory that must return results in timestamp/partition order
  • Factories with filter + LIMIT that need early termination in order

Implementation Options

  1. Dual-mode PageFrameSequence: Add a constructor flag or subclass for ordered vs. unordered collection
  2. Separate class: Create UnorderedPageFrameSequence to avoid complexity in the existing class
  3. Refactor to shared base: Extract common dispatch/work-stealing logic, with different collection strategies

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Additional context

No response

Metadata

Metadata

Assignees

Labels

EnhancementEnhance existing functionalityPerformancePerformance improvementsSQLIssues or changes relating to SQL execution

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions