Skip to content

Preserve RoundTrip types in RowConverter even if preserve_dictionaries=false #4813

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We struggle with the memory used by the RowConverter when interning values from DictionaryArrays. We are even proposing a special CardinalityAware wrapper on top of the RowConverter in DataFusion (see apache/datafusion#7401)

At the moment, round tripping data from Array to Rows and then back to Array works like this:

DictionaryArray -- (preserve_dictionaries = false) --> Rows --> Primtive/StringArray

In DataFusion we must maintain the same input / output types, so in our proposed improvement we needed to add a call to cast, which @tustvold notes is likely very expensive: https://github.com/apache/arrow-datafusion/pull/7401/files#r1324281222

Describe the solution you'd like
I would like the RowConverter to produce the same output type as the input type on SortField, even if preserve_dictionaries is set to false

This would avoid a copy of the String data and likely perform much better.

Describe alternatives you've considered
We could potentially simply remove stateful row encoding: #4811

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions