ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns#8311
ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns#8311bkietz wants to merge 2 commits intoapache:masterfrom
Conversation
| } | ||
|
|
||
| DCHECK(lhs.is_array()); | ||
| if (lhs.type()->id() == Type::DICTIONARY && rhs.type()->id() == Type::DICTIONARY) { |
There was a problem hiding this comment.
@wesm What do you think about adding kernels to scalar_compare.cc which do this inside compute:: ?
There was a problem hiding this comment.
Yes, this sounds fine, can you open a JIRA issue about it?
jorisvandenbossche
left a comment
There was a problem hiding this comment.
For me the non-performant way of decoding is fine for now (certainly because the array+scalar case will be more common).
But should there be some more tests added?
Could also use the small reproducer from the issue (my comment) to add as a python test
| } | ||
|
|
||
| auto maybe_min = min->CastTo(field->type()); | ||
| auto maybe_max = max->CastTo(field->type()); |
There was a problem hiding this comment.
Does this change behaviour? For a dictionary with string values, is field->type() string or dictionary?
There was a problem hiding this comment.
StatisticsAsScalars returns scalars whose types are the correct physical type, so even if the column was dictionary(string) min and max would be just string before this cast
There was a problem hiding this comment.
(IE, it only changes behavior in cases where the physical type wasn't appropriate)
Parquet row group statistics did not respect dict encoding. Also added a workaround to support filtering a dictionary encoded column.