Add a `CardinalityAwareRowConverter` by JayjeetAtGithub · Pull Request #4736 · apache/arrow-rs

JayjeetAtGithub · 2023-08-24T23:47:52Z

Which issue does this PR close?

This PR adds a CardinalityAwareRowConverter (a wrapper around RowConverter) to arrow-row. Basically, when the cardinality of dict-encoded sort fields is >= 10, we don't preserve dictionary encoding any more and fall back to using string encoding.

Closes apache/datafusion#7200.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

JayjeetAtGithub · 2023-08-25T00:04:59Z

Currently, I hardcode the dictionary key type to be Int and UInt types. I am working on fixing that. In general, any ideas on how to count unique items in a dict-encoded columns efficiently would be super handy. Thanks.

alamb · 2023-08-25T16:23:32Z

@JayjeetAtGithub -- I think after some thought we should put this code into DataFusion (it will need to be changed).

The reason I think this belongs in DataFusion is that the arrow row converter already has a way to control if interning is used for dictionaries. I seems somewhat confusing to then also have something that automatically picks an interning strategy based on cardinality -- I think until we see others wanting to use this, let's leave this in DataFusion

alamb · 2023-08-25T16:24:17Z

Currently, I hardcode the dictionary key type to be Int and UInt types. I am working on fixing that. In general, any ideas on how to count unique items in a dict-encoded columns efficiently would be super handy. Thanks.

Dictionaries can only be one of https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowDictionaryKeyType.html (so Int/UInt is the right list of types)

alamb

Thanks again @JayjeetAtGithub -- As I commented before I suggest we start with this code in the datafusion repo and then we can move it upstream in the future if it turns out to be more commonly useful

Let's move the conversation to apache/datafusion#7401

alamb · 2023-08-25T16:26:10Z

arrow-row/src/lib.rs

+                        _ => unreachable!(),
+                    };
+
+                    if cardinality >= LOW_CARDINALITY_THRESHOLD {


this will effectively switch the encoding mid-steam, I think -- which will mean that the output can't be compared with previously created rows, which is not correct.

I think the decision has to be made based on the first batch and then that decision used for encoding all rows

@alamb I am sorry, I didn't quite get it. I thought what I was doing was, tapping into the first batch of the stream, looking into it, setting the right codec (whether to use the interner or not) to encode the batches, and then let the conversion going for all the batches (including the first one).

alamb · 2023-08-25T16:30:58Z

arrow-row/src/lib.rs

+            for (i, col) in columns.iter().enumerate() {
+                if let DataType::Dictionary(k, _) = col.data_type() {
+                    // let cardinality = col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len();
+                    let cardinality = match k.as_ref() {


I originally thought that we should base the decision of "is this a high cardinality column" on the "in use" cardinality of the dictionary (aka how may distinct key values there were -- as suggested on apache/datafusion#7200 (comment))

However, I now realize that maybe the number of potential key values (aka the length of the values array) is actually a more robust predictor of being "high cardinality" (as the other values in the dictionary could be used in subsequent batches, perhaps)

Do you have any opinion @tustvold ?

Given the RowConverter blindly generates a mapping for all values, regardless of if they appear in the keys, I think we should just use the length of the values. Whilst an argument could be made for doing something more sophisticated, this would only really make sense if the dictionary interner itself followed a similar approach

Makes sense. Thank you

JayjeetAtGithub · 2023-08-29T20:35:22Z

Closing this PR in lieu of apache/datafusion#7401. Shall continue the discussion there.

JayjeetAtGithub added 5 commits August 24, 2023 16:07

Initial version of CardinalityAwareRowConverter

1c01208

Assume keys are always int32 type

af61140

Assume keys are always int32 type

e70d4e6

Remove unnecessary imports

37f4578

Add a size method

c3cdfd3

github-actions bot added the arrow Changes to the arrow crate label Aug 24, 2023

JayjeetAtGithub changed the title ~~Sort fix~~ Add a CardinalityAwareRowConverter Aug 24, 2023

JayjeetAtGithub mentioned this pull request Aug 24, 2023

Implement CardinalityAwareRowConverter while doing streaming merge apache/datafusion#7401

Merged

JayjeetAtGithub added 2 commits August 24, 2023 17:44

Support other integer key types

70ce44e

Define the low cardinality threshold using a const

b436084

alamb reviewed Aug 25, 2023

View reviewed changes

tustvold marked this pull request as draft August 25, 2023 16:34

JayjeetAtGithub closed this Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add a `CardinalityAwareRowConverter` #4736

Add a `CardinalityAwareRowConverter` #4736
JayjeetAtGithub wants to merge 7 commits intoapache:masterfrom
JayjeetAtGithub:sort-fix

JayjeetAtGithub commented Aug 24, 2023

Uh oh!

JayjeetAtGithub commented Aug 25, 2023 •

edited

Loading

Uh oh!

alamb commented Aug 25, 2023

Uh oh!

alamb commented Aug 25, 2023 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 25, 2023

Uh oh!

JayjeetAtGithub Aug 28, 2023

Uh oh!

alamb Aug 25, 2023

Uh oh!

tustvold Aug 25, 2023

Uh oh!

alamb Aug 25, 2023

Uh oh!

JayjeetAtGithub commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

JayjeetAtGithub commented Aug 24, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

JayjeetAtGithub commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Aug 25, 2023

Uh oh!

alamb commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

JayjeetAtGithub Aug 28, 2023

Choose a reason for hiding this comment

Uh oh!

alamb Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

tustvold Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

alamb Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

JayjeetAtGithub commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JayjeetAtGithub commented Aug 25, 2023 •

edited

Loading

alamb commented Aug 25, 2023 •

edited

Loading