ARROW-16616: [Python] Add lazy Dataset.filter() method#13409
ARROW-16616: [Python] Add lazy Dataset.filter() method#13409amol- merged 31 commits intoapache:masterfrom amol-:ARROW-16616
Conversation
|
|
|
I have no idea whether we want to expose such lazy construction APIs on Dataset. |
FYI, this task has spawned from #13155 (comment) discussion |
|
@pitrou Addressed all comments, should be ready for re-review |
westonpace
left a comment
There was a problem hiding this comment.
I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?
If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.
|
I am personally also a bit wary of adding a new public class like |
|
|
@westonpace I'm not particularly attached to the I also dislike the idea of reusing the |
|
From #13409 (comment)
Could you expand on this a bit? (I don't know where or why it was asked to move to a dedicated class, the only reference I find in the other PR is the question if this shouldn't live on the Scanner) It seems to me that if we want to expose a helper |
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Thanks for the update!
There are a few more complexities:
There are some other Dataset methods that might also need to take into account this filter: replace_schema, get_fragments.
And what with a UnionDataset unioning filtered datasets ?
We should probably also note somewhere that Fragments don't inherit this filter.
The ParquetFileFragment has some methods that work with filters (split_by_row_group, subset).
python/pyarrow/includes/libarrow.pxd
Outdated
|
|
||
| cdef cppclass CSourceNodeOptions "arrow::compute::SourceNodeOptions"(CExecNodeOptions): | ||
| @staticmethod | ||
| CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader( |
There was a problem hiding this comment.
Is this still used in the current version of the PR?
There was a problem hiding this comment.
No anymore, but it still had a value on its own, so I didn't remove it. If it's a concern I can easily get rid of it. (Even though I would leave the C++ implementation around).
Co-authored-by: Joris Van den Bossche <[email protected]>
|
@jorisvandenbossche I checked for If the dataset is created using It seems that Asking because if the plan is to deprecate it some day, then it probably doesn't make much sense to invest the effort to work toward feature parity with |
|
I think you certainly don't have to care about ParquetDataset in this PR (we generally didn't add any of the additional methods from pyarrow.dataset.Dataset to it, so this PR just follows that). And let's have the discussion about what to do with ParquetDataset in a dedicated place (eg we already have https://issues.apache.org/jira/browse/ARROW-9720, although the description there is a bit outdated) |
Expose a
Dataset.filtermethod that applies a filter to the dataset without actually loading it in memory.Addresses what was discussed in #13155 (comment)
Dataset.somethingmethods.dataset(filter=X).filter(filter=Y).scanner(filter=Z)(related to ARROW-16616: [Python] Add lazy Dataset.filter() method #13409 (comment))Datasetclass instead ofFilteredDatasetas discussed with @jorisvandenbossche