Skip to content

Fix RowConverter roundtrip for List of Dictionary#8067

Closed
LiaCastaneda wants to merge 2 commits intoapache:mainfrom
LiaCastaneda:lia/fix-row-converter
Closed

Fix RowConverter roundtrip for List of Dictionary#8067
LiaCastaneda wants to merge 2 commits intoapache:mainfrom
LiaCastaneda:lia/fix-row-converter

Conversation

@LiaCastaneda
Copy link

@LiaCastaneda LiaCastaneda commented Aug 6, 2025

Which issue does this PR close?

Closes apache/datafusion#17012

Rationale for this change

The rountrip List<Dictionary<(),()>> -> Row -> List<Dictionary<(),()>> is failing:

Expected infallible creation of GenericListArray from ArrayDataRef failed: InvalidArgumentError("[Large]ListArray's child datatype Utf8 does not correspond to the List's datatype Dictionary(Int8, Utf8)")

What changes are included in this PR?

My approach is to encode back to Dictionary when we convert back the rows to Arrays, a straightforward way to do is is ussing arrow_cast::cast that under the hood builds a dictionary from an Array.

Are these changes tested?

Yes, I added a roundtrip test that verifies that the original Array is the same as the resulting one after the roundtrip

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 6, 2025
arrow-buffer = { workspace = true }
arrow-data = { workspace = true }
arrow-schema = { workspace = true }
arrow-cast = { workspace = true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally want to avoid doing this, as it adds a significant additional dependency

@tustvold
Copy link
Contributor

tustvold commented Aug 6, 2025

I think the issue is that Datafusion is not handling the fact that row encoding "hydrates" dictionaries. It should be updated to understand that List<Dictionary<...>> will be converted to List<...>, much like it already handles this for the non-nested case. Converting back to a dictionary is expensive, and likely pointless, not to mention a breaking change.

@tustvold tustvold added the api-change Changes to the arrow API label Aug 6, 2025
@tustvold
Copy link
Contributor

tustvold commented Aug 6, 2025

See #7165

@LiaCastaneda
Copy link
Author

Thanks @tustvold , then I understand the solution on @ding-young 's PR #7627 would be what we want? since I understand it keeps the flattened List

@LiaCastaneda
Copy link
Author

LiaCastaneda commented Aug 8, 2025

closing this one in favour of ding-young's PR

alamb added a commit that referenced this pull request Aug 21, 2025
# Which issue does this PR close?

- related to #7627
- Related to #4811

# Rationale for this change

It was not clear to me what the expected behavior for round trip through
row converter was for DictionaryArrays, so let's document what @tustvold
says here:
#8067 (comment)

> I think the issue is that Datafusion is not handling the fact that row
encoding "hydrates" dictionaries. It should be updated to understand
that List<Dictionary<...>> will be converted to List<...>, much like it
already handles this for the non-nested case. Converting back to a
dictionary is expensive, and likely pointless, not to mention a breaking
change.


# What changes are included in this PR?

Document expected behavior with english comments and doc test

# Are these changes tested?

Yes (doctests)

# Are there any user-facing changes?

More docs, no behavior change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query grouping by column with datatype List<Dictionary<(),()>> is failing

2 participants