Add a CardinalityAwareRowConverter #4736
Add a CardinalityAwareRowConverter #4736JayjeetAtGithub wants to merge 7 commits intoapache:masterfrom
CardinalityAwareRowConverter #4736Conversation
|
Currently, I hardcode the dictionary key type to be |
|
@JayjeetAtGithub -- I think after some thought we should put this code into DataFusion (it will need to be changed). The reason I think this belongs in DataFusion is that the arrow row converter already has a way to control if interning is used for dictionaries. I seems somewhat confusing to then also have something that automatically picks an interning strategy based on cardinality -- I think until we see others wanting to use this, let's leave this in DataFusion |
Dictionaries can only be one of https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowDictionaryKeyType.html (so Int/UInt is the right list of types) |
alamb
left a comment
There was a problem hiding this comment.
Thanks again @JayjeetAtGithub -- As I commented before I suggest we start with this code in the datafusion repo and then we can move it upstream in the future if it turns out to be more commonly useful
Let's move the conversation to apache/datafusion#7401
| _ => unreachable!(), | ||
| }; | ||
|
|
||
| if cardinality >= LOW_CARDINALITY_THRESHOLD { |
There was a problem hiding this comment.
this will effectively switch the encoding mid-steam, I think -- which will mean that the output can't be compared with previously created rows, which is not correct.
I think the decision has to be made based on the first batch and then that decision used for encoding all rows
There was a problem hiding this comment.
@alamb I am sorry, I didn't quite get it. I thought what I was doing was, tapping into the first batch of the stream, looking into it, setting the right codec (whether to use the interner or not) to encode the batches, and then let the conversion going for all the batches (including the first one).
| for (i, col) in columns.iter().enumerate() { | ||
| if let DataType::Dictionary(k, _) = col.data_type() { | ||
| // let cardinality = col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len(); | ||
| let cardinality = match k.as_ref() { |
There was a problem hiding this comment.
I originally thought that we should base the decision of "is this a high cardinality column" on the "in use" cardinality of the dictionary (aka how may distinct key values there were -- as suggested on apache/datafusion#7200 (comment))
However, I now realize that maybe the number of potential key values (aka the length of the values array) is actually a more robust predictor of being "high cardinality" (as the other values in the dictionary could be used in subsequent batches, perhaps)
Do you have any opinion @tustvold ?
There was a problem hiding this comment.
Given the RowConverter blindly generates a mapping for all values, regardless of if they appear in the keys, I think we should just use the length of the values. Whilst an argument could be made for doing something more sophisticated, this would only really make sense if the dictionary interner itself followed a similar approach
|
Closing this PR in lieu of apache/datafusion#7401. Shall continue the discussion there. |
Which issue does this PR close?
This PR adds a
CardinalityAwareRowConverter(a wrapper aroundRowConverter) toarrow-row. Basically, when the cardinality of dict-encoded sort fields is>= 10, we don't preserve dictionary encoding any more and fall back to using string encoding.Closes apache/datafusion#7200.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?