Skip to content

Conversation

@haohuaijin
Copy link
Collaborator

@haohuaijin haohuaijin commented Dec 9, 2025

when enable utf8view, for below query, because stringview shared the underly buffer, when send data from follower node to leader node, the arrow-flight will encoding the total buffer again and again, that cost the perf and memory issue.

SELECT trace_id, min(_timestamp) as t FROM default group by trace_id order by t desc

main branch

image

this pr

image

Copilot AI review requested due to automatic review settings December 9, 2025 08:16
@haohuaijin haohuaijin added the Needs-Testing Needs-Testing label Dec 9, 2025
@haohuaijin haohuaijin added this to the v0.30.0 milestone Dec 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

Failed to generate code suggestions for PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 9, 2025

Greptile Overview

Greptile Summary

Added garbage collection for StringViewArray columns before encoding record batches for gRPC transport to reduce memory usage and improve performance when sending data to the leader node.

Key Changes:

  • Imported necessary Arrow array types (Array, ArrayRef, AsArray, RecordBatchOptions, StringViewBuilder)
  • Added gc_string_view_batch() function that compacts sparse StringViewArray columns using a heuristic (compacts when actual buffer size > 2x ideal size)
  • Integrated garbage collection into encode_batch() method before splitting and encoding

How it Works:
The implementation follows the DataFusion approach (referenced in apache/datafusion#11587) to address memory bloat in StringViewArray after operations like filtering. The function iterates through all columns, identifies StringViewArray columns, calculates if they're holding excessive unused buffer space, and rebuilds them with a single consolidated buffer when needed. This reduces memory overhead during network transmission while preserving data integrity.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is a well-documented performance optimization that follows an established pattern from Apache DataFusion. The code includes comprehensive documentation explaining the heuristic, handles edge cases (empty buffers, non-StringView columns), and doesn't modify existing test coverage. The change is isolated to the encoding path and preserves data integrity through proper RecordBatch reconstruction.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
src/flight/src/encoder.rs 5/5 Added string view garbage collection before encoding to reduce memory usage when sending record batches to leader

Sequence Diagram

sequenceDiagram
    participant Client
    participant FlightDataEncoder
    participant gc_string_view_batch
    participant StringViewBuilder
    participant Leader

    Client->>FlightDataEncoder: encode_batch(RecordBatch)
    FlightDataEncoder->>gc_string_view_batch: gc_string_view_batch(&batch)
    
    loop For each column in RecordBatch
        gc_string_view_batch->>gc_string_view_batch: Check if column is StringViewArray
        alt Is StringViewArray
            gc_string_view_batch->>gc_string_view_batch: Check if data buffers empty
            alt Buffers not empty
                gc_string_view_batch->>gc_string_view_batch: Calculate ideal_buffer_size
                gc_string_view_batch->>gc_string_view_batch: Calculate actual_buffer_size
                alt actual_buffer_size > (ideal_buffer_size * 2)
                    gc_string_view_batch->>StringViewBuilder: Create builder with capacity
                    gc_string_view_batch->>StringViewBuilder: Set fixed_block_size
                    loop For each string view
                        gc_string_view_batch->>StringViewBuilder: append_option(v)
                    end
                    StringViewBuilder->>gc_string_view_batch: Return compacted StringViewArray
                else Sparse enough
                    gc_string_view_batch->>gc_string_view_batch: Return original column
                end
            else Buffers empty
                gc_string_view_batch->>gc_string_view_batch: Return original column
            end
        else Not StringViewArray
            gc_string_view_batch->>gc_string_view_batch: Return original column
        end
    end
    
    gc_string_view_batch->>FlightDataEncoder: Return compacted RecordBatch
    FlightDataEncoder->>FlightDataEncoder: split_batch_for_grpc_response()
    FlightDataEncoder->>FlightDataEncoder: encode() to FlightData
    FlightDataEncoder->>Leader: Send FlightData (reduced memory)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a performance optimization that garbage collects StringView arrays before encoding RecordBatches for transmission to the leader node. The implementation is adapted from Apache DataFusion (PR #11587) to address memory inefficiency issues documented in OpenObserve issue #8280.

Key Changes:

  • Added gc_string_view_batch function that heuristically compacts StringView arrays when their actual buffer size exceeds 2x the ideal size
  • Integrated the GC operation into the encode_batch method before splitting batches for gRPC transmission
  • Added necessary imports for StringView array manipulation (Array, ArrayRef, AsArray, RecordBatchOptions, StringViewBuilder)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@haohuaijin haohuaijin merged commit 8021a0a into branch-v0.30.0 Dec 9, 2025
45 checks passed
@haohuaijin haohuaijin deleted the stringview-gc-30 branch December 9, 2025 13:16
@haohuaijin haohuaijin added Testing-Completed Testing-Completed and removed Needs-Testing Needs-Testing labels Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Testing-Completed Testing-Completed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants