Skip to content

ARROW-10100: [C++][Python][Dataset] Add ParquetFileFragment::Subset method#8301

Closed
jorisvandenbossche wants to merge 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-10100-parquetfilefragment-subset
Closed

ARROW-10100: [C++][Python][Dataset] Add ParquetFileFragment::Subset method#8301
jorisvandenbossche wants to merge 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-10100-parquetfilefragment-subset

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

No description provided.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@bkietz this is what I had in mind for https://issues.apache.org/jira/browse/ARROW-10100. Thoughts?

(only POC, if we add it, need to add proper tests etc)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would still need to check that all row_group_ids exist in the fragment

@github-actions
Copy link
Copy Markdown

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-10100-parquetfilefragment-subset branch from 288a329 to a4faba0 Compare October 9, 2020 09:43
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review October 9, 2020 09:44
@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@bkietz so I was now thinking it is actually fine to return a fragment with 0 row groups (it can then be the responsibility of the user to check for this, if they want to filter out the empty fragments out of their list of fragments). But, the problem is that an empty row group vector currently already means "all row groups" ...
Would it be fine to change empty vector to nullptr to indicate "all row groups" ?

Copy link
Copy Markdown
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being more explicit about "lack of subselection" vs "empty subselection" sounds fine to me. I'll push a patch

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@bkietz Thanks for the update! I further expanded the tests somewhat for the empty case.

Copy link
Copy Markdown
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, build error is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants