ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns by bkietz · Pull Request #8311 · apache/arrow

bkietz · 2020-09-30T19:33:06Z

Parquet row group statistics did not respect dict encoding. Also added a workaround to support filtering a dictionary encoded column.

…t columns

bkietz · 2020-09-30T19:34:13Z

cpp/src/arrow/dataset/filter.cc

    }

-    DCHECK(lhs.is_array());
+    if (lhs.type()->id() == Type::DICTIONARY && rhs.type()->id() == Type::DICTIONARY) {


@wesm What do you think about adding kernels to scalar_compare.cc which do this inside compute:: ?

Yes, this sounds fine, can you open a JIRA issue about it?

github-actions · 2020-09-30T20:15:38Z

https://issues.apache.org/jira/browse/ARROW-10008

jorisvandenbossche

For me the non-performant way of decoding is fine for now (certainly because the array+scalar case will be more common).

But should there be some more tests added?

Could also use the small reproducer from the issue (my comment) to add as a python test

jorisvandenbossche · 2020-10-02T16:05:07Z

cpp/src/arrow/dataset/file_parquet.cc

  }

+  auto maybe_min = min->CastTo(field->type());
+  auto maybe_max = max->CastTo(field->type());


Does this change behaviour? For a dictionary with string values, is field->type() string or dictionary?

StatisticsAsScalars returns scalars whose types are the correct physical type, so even if the column was dictionary(string) min and max would be just string before this cast

(IE, it only changes behavior in cases where the physical type wasn't appropriate)

ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dic…

2ab5a8f

…t columns

bkietz commented Sep 30, 2020

View reviewed changes

bkietz requested a review from jorisvandenbossche September 30, 2020 19:34

jorisvandenbossche reviewed Oct 2, 2020

View reviewed changes

add test

5f20975

jorisvandenbossche approved these changes Oct 5, 2020

View reviewed changes

bkietz closed this in ecc3ed8 Oct 5, 2020

bkietz deleted the 10008-dict-row-group-stats branch February 25, 2021 16:18

asfimport mentioned this pull request Oct 22, 2020

[Python] pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False #26032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns#8311

ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns#8311
bkietz wants to merge 2 commits intoapache:masterfrom
bkietz:10008-dict-row-group-stats

bkietz commented Sep 30, 2020

Uh oh!

bkietz Sep 30, 2020

Uh oh!

wesm Oct 22, 2020

Uh oh!

github-actions bot commented Sep 30, 2020

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Oct 2, 2020

Uh oh!

bkietz Oct 2, 2020

Uh oh!

bkietz Oct 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bkietz commented Sep 30, 2020

Uh oh!

bkietz Sep 30, 2020

Choose a reason for hiding this comment

Uh oh!

wesm Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 30, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants