Alternative `PageFrameSequence` implementation to get rid of Head-of-Line Blocking

## Problem: Head-of-Line Blocking in PageFrameSequence                                                                                                                                      
                                                                                                                                                                                              
  ### Background                                                                                                                                                                              
                                                                                                                                                                                              
  `PageFrameSequence` is the central coordinator for parallel query execution in QuestDB. It dispatches page frame tasks to worker threads via a ring queue and collects results as workers complete them.                                                                                                                                                                              
                                                                                                                                                                                              
  The current implementation uses an **ordered collection model**: tasks are dispatched to sequential ring queue slots, and the `collectSubSeq` (Single Consumer Sequence) advances through slots in FIFO order. A task at slot N can only be collected after slot N-1 has been collected.                                                                                              
                                                                                                                                                                                              
  ### The Problem                                                                                                                                                                             
                                                                                                                                                                                              
  This design causes **head-of-line blocking** when workers have variable completion times.                                                                                                   
                                                                                                                                                                                              
  **Scenario:**                                                                                                                                                                               
  1. Frames 0, 1, 2, 3, 4 are dispatched to queue slots 100, 101, 102, 103, 104                                                                                                               
  2. Worker A processes frame 0 (slot 100), Workers B-E process frames 1-4                                                                                                                    
  3. Worker A is slow (e.g., frame 0 has more rows, complex filter, or cold storage access)                                                                                                   
  4. Workers B-E complete frames 1-4 quickly                                                                                                                                                  
                                                                                                                                                                                              
  **Result:**                                                                                                                                                                                 
  - `collectSubSeq` is blocked at slot 100 waiting for Worker A                                                                                                                               
  - Completed frames 1-4 sit in the queue, unable to be collected                                                                                                                             
  - The queue fills up, blocking dispatch of new frames                                                                                                                                       
  - Work stealing can process other queued tasks but cannot help with the slow task already owned by Worker A                                                                                 
  - Pipeline throughput degrades to single-worker speed until the slow task completes                                                                                                         
                                                                                                                                                                                              
  ### Downsides                                                                                                                                                                               
                                                                                                                                                                                              
  1. **Reduced parallelism**: Completed work waits idle while a slow task blocks the pipeline                                                                                                 
  2. **Queue pressure**: Full queue forces fallback to local execution (`workLocally()`), losing parallelism for remaining frames                                                             
  3. **Latency spikes**: A single slow partition/frame can significantly increase total query time                                                                                            
  4. **Wasted resources**: Worker threads that finished early sit idle waiting for collection to unblock                                                                                      
  5. **Unnecessary complexity**: Factories that don't need ordered results (GROUP BY, top K, etc.) pay the cost of ordered collection infrastructure                                          
                                                                                                                                                                                              
  ### Observation                                                                                                                                                                             
                                                                                                                                                                                              
  The `mergeShards()` method in `AsyncGroupByRecordCursor` already uses a simpler, non-blocking pattern:                                                                                      
                                                                                                                                                                                              
  ```java
  // Count-based completion tracking
  private final SOUnboundedCountDownLatch postAggregationDoneLatch;

  // Simple publish loop
  for (int i = 0; i < shardCount; i++) {
      long cursor = pubSeq.next();
      if (cursor < 0) {
          // Queue full - do work locally
          atom.mergeShard(-1, i);
      } else {
          queue.get(cursor).of(...);
          pubSeq.done(cursor);
          queuedCount++;
      }
  }

  // Wait for completion count, not ordered collection
  while (!postAggregationDoneLatch.done(queuedCount)) {
      // Work steal from queue
  }
```

  This pattern has no head-of-line blocking because completion is tracked by count, not by queue slot order.                                                                                  
                                                                                                                                                                                              
  ### Proposed Solution                                                                                                                                                                           
                                                                                                                                                                                              
  Introduce a simplified `PageFrameSequence` variant (or mode) for factories that don't require ordered collection.                                                                             
                                                                                                                                                                                              
  Key changes:                                                                                                                                                                                
  - Replace `collectSubSeq` + collect fan-out with `SOUnboundedCountDownLatch`
  - Workers call `latch.countDown()` after processing (instead of relying on queue slot advancement)                                                                                            
  - Owner waits for `latch.done(frameCount)` instead of ordered next()/collect() loop                                                                                                           
  - Remove `collectedFrameIndex` tracking                                                                                                                                                       
  - Remove shards - a single, generously-sized queue would be enough
  - Unlike `PageFrameReduceTask`, tasks in the new queue don't need to include off-heap row id list: all frame memory pools and resources used by the worker threads should be located in the corresponding atom object
                                                                                                                                                                                              
  Benefits:                                                                                                                                                                                   
  - No head-of-line blocking - any task can complete in any order                                                                                                                             
  - Simpler code path - fewer sequences and fan-outs to manage                                                                                                                                
  - Better throughput when frame processing times vary (different partition sizes, filter selectivity, etc.)                                                                                  
  - Reduced latency variance                                                                                                                                                                  
                                                                                                                                                                                              
  Applicable factories:                                                                                                                                                                       
  - `AsyncGroupByRecordCursorFactory` - aggregates into maps, order irrelevant
  - `AsyncGroupByNotKeyedRecordCursorFactory` - single aggregate result, order irrelevant
  - `AsyncTopKRecordCursorFactory` - reduces into ordered row id lists, order irrelevant
  - Potentially other parallel factories that don't produce ordered output                                                                                                                    
                                                                                                                                                                                              
  Non-applicable factories:                                                                                                                                                                   
  - Any factory that must return results in timestamp/partition order                                                                                                                         
  - Factories with filter + LIMIT that need early termination in order                                                                                                                                 
                                                                                                                                                                                              
  ### Implementation Options
                                                                                                                                                                                              
  1. Dual-mode `PageFrameSequence`: Add a constructor flag or subclass for ordered vs. unordered collection                                                                                     
  2. Separate class: Create `UnorderedPageFrameSequence` to avoid complexity in the existing class                                                                                              
  3. Refactor to shared base: Extract common dispatch/work-stealing logic, with different collection strategies

### Full Name:

Andrei Pechkurov

### Affiliation:

QuestDB

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative `PageFrameSequence` implementation to get rid of Head-of-Line Blocking #6655

Problem: Head-of-Line Blocking in PageFrameSequence

Background

The Problem

Downsides

Observation

Proposed Solution

Implementation Options

Full Name:

Affiliation:

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative PageFrameSequence implementation to get rid of Head-of-Line Blocking #6655

Description

Problem: Head-of-Line Blocking in PageFrameSequence

Background

The Problem

Downsides

Observation

Proposed Solution

Implementation Options

Full Name:

Affiliation:

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Alternative `PageFrameSequence` implementation to get rid of Head-of-Line Blocking #6655