-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket trans improvements to the BatchCoalescer for the usecase described in #6692
That is
┌────────────────────┐ Filter
│ │ ┌────────────────────┐ Coalesce
│ │ ─ ─ ─ ─ ─ ─ ▶ │ RecordBatch │ Batches
│ RecordBatch │ │ num_rows = 234 │─ ─ ─ ─ ─ ┐
│ num_rows = 8000 │ └────────────────────┘
│ │ │
│ │ ┌────────────────────┐
└────────────────────┘ │ │ │
┌────────────────────┐ ┌────────────────────┐ │ │
│ │ Filter │ │ │ │ │
│ │ │ RecordBatch │ ─ ─ ─ ─ ─ ▶│ │
│ RecordBatch │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 500 │─ ─ ─ ─ ─ ┐ │ │
│ num_rows = 8000 │ │ │ │ RecordBatch │
│ │ │ │ └ ─ ─ ─ ─ ─▶│ num_rows = 8000 │
│ │ └────────────────────┘ │ │
└────────────────────┘ │ │
... ─ ─ ─ ─ ─ ▶│ │
... ... │ │ │
│ │
┌────────────────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐
│ │ Filter │ │ │
│ RecordBatch │ │ RecordBatch │
│ num_rows = 8000 │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 333 │─ ─ ─ ─ ─ ┘
│ │ │ │
│ │ └────────────────────┘
└────────────────────┘
FilterExec RepartitonExec copies the data
creates output batches with copies *again* to form final large
of the matching rows (calls take() RecordBatches
to make a copy)
This ticket tracks additional follow on work:
- [coalesce] Implement specialized
BatchCoalescer::push_batchforPrimitiveArray#7763 - [coalesce] Implement specialized
BatchCoalescer::push_batchforStringArray#7764 - [coalesce] Implement specialized
BatchCoalescer::push_batch_with_filterfor primitive array #7762 - [coalesce] Implement specialized
BatchCoalescer::push_batch_with_filterfor binaryview array #9143 - [coalesce] Special case
BatchCoalescer/GenericInProgressArraywhen multiple batches are pushed in with the same buffer #7765 - Avoid copies for
push_batch_with_filterfor primitive types #9136 -
BatchCoalescerbut without automatic batching #8850
Additional context
- the use case is described in detail here Optimize take/filter/concat from multiple input arrays to a single large output array #6692
Additional context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog