-
Notifications
You must be signed in to change notification settings - Fork 715
perf: gc stringview before send recordbatch to leader[branch-30] #9526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Failed to generate code suggestions for PR |
Greptile OverviewGreptile SummaryAdded garbage collection for Key Changes:
How it Works: Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Client
participant FlightDataEncoder
participant gc_string_view_batch
participant StringViewBuilder
participant Leader
Client->>FlightDataEncoder: encode_batch(RecordBatch)
FlightDataEncoder->>gc_string_view_batch: gc_string_view_batch(&batch)
loop For each column in RecordBatch
gc_string_view_batch->>gc_string_view_batch: Check if column is StringViewArray
alt Is StringViewArray
gc_string_view_batch->>gc_string_view_batch: Check if data buffers empty
alt Buffers not empty
gc_string_view_batch->>gc_string_view_batch: Calculate ideal_buffer_size
gc_string_view_batch->>gc_string_view_batch: Calculate actual_buffer_size
alt actual_buffer_size > (ideal_buffer_size * 2)
gc_string_view_batch->>StringViewBuilder: Create builder with capacity
gc_string_view_batch->>StringViewBuilder: Set fixed_block_size
loop For each string view
gc_string_view_batch->>StringViewBuilder: append_option(v)
end
StringViewBuilder->>gc_string_view_batch: Return compacted StringViewArray
else Sparse enough
gc_string_view_batch->>gc_string_view_batch: Return original column
end
else Buffers empty
gc_string_view_batch->>gc_string_view_batch: Return original column
end
else Not StringViewArray
gc_string_view_batch->>gc_string_view_batch: Return original column
end
end
gc_string_view_batch->>FlightDataEncoder: Return compacted RecordBatch
FlightDataEncoder->>FlightDataEncoder: split_batch_for_grpc_response()
FlightDataEncoder->>FlightDataEncoder: encode() to FlightData
FlightDataEncoder->>Leader: Send FlightData (reduced memory)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a performance optimization that garbage collects StringView arrays before encoding RecordBatches for transmission to the leader node. The implementation is adapted from Apache DataFusion (PR #11587) to address memory inefficiency issues documented in OpenObserve issue #8280.
Key Changes:
- Added
gc_string_view_batchfunction that heuristically compacts StringView arrays when their actual buffer size exceeds 2x the ideal size - Integrated the GC operation into the
encode_batchmethod before splitting batches for gRPC transmission - Added necessary imports for StringView array manipulation (Array, ArrayRef, AsArray, RecordBatchOptions, StringViewBuilder)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
when enable utf8view, for below query, because stringview shared the underly buffer, when send data from follower node to leader node, the arrow-flight will encoding the total buffer again and again, that cost the perf and memory issue.
main branch
this pr