ARROW-16616: [Python] Add lazy Dataset.filter() method by amol- · Pull Request #13409 · apache/arrow

amol- · 2022-06-21T16:54:28Z

Expose a Dataset.filter method that applies a filter to the dataset without actually loading it in memory.

Addresses what was discussed in #13155 (comment)

Update documentation
Ensure the filtered dataset preserves the filter when writing it back
Ensure the filtered dataset preserves the filter when joining
Ensure the filtered dataset preserves the filter when applying standard Dataset.something methods.
Allow to extend the filter by adding more conditions subsequently dataset(filter=X).filter(filter=Y).scanner(filter=Z) (related to ARROW-16616: [Python] Add lazy Dataset.filter() method #13409 (comment))
Refactor to use only Dataset class instead of FilteredDataset as discussed with @jorisvandenbossche
Add support in replace_schema
Error in get_fragments in case a filter is set.
Verify support in UnionDataset

github-actions · 2022-06-21T17:02:02Z

https://issues.apache.org/jira/browse/ARROW-16616

github-actions · 2022-06-21T17:02:04Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

pitrou · 2022-06-28T15:00:50Z

I have no idea whether we want to expose such lazy construction APIs on Dataset.

cc @jorisvandenbossche

amol- · 2022-06-28T15:49:10Z

I have no idea whether we want to expose such lazy construction APIs on Dataset.

FYI, this task has spawned from #13155 (comment) discussion

cpp/src/arrow/compute/exec/options.cc

docs/source/python/compute.rst

python/pyarrow/_dataset.pyx

python/pyarrow/_exec_plan.pyx

amol- · 2022-07-05T17:28:30Z

@pitrou Addressed all comments, should be ready for re-review

westonpace

I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?

If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.

cpp/src/arrow/compute/exec/options.cc

python/pyarrow/_dataset.pyx

python/pyarrow/_exec_plan.pyx

jorisvandenbossche · 2022-07-07T15:25:44Z

I am personally also a bit wary of adding a new public class like FilteredDataset (at least until we have had the broader discussion about how we want to provide a dataframe-like / query object interface, as similar discussions will keep coming up for other methods).
If we want to provide this filter() method on the short term, I would also prefer doing it just on Dataset, as Weston suggested (that was also my original idea for this issue). Although that also creates its backwards compatibility issues of course, if we later let this method return an object backed by a query plan, as that then might not keep all methods/attributes that are currently available on Dataset.

jorisvandenbossche · 2022-07-07T15:32:26Z

get_fragments is also another method where the user could expect this filter to be applied (giving the same result as specifying the filter keyword)

amol- · 2022-07-11T10:45:02Z

I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?

If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.

@westonpace I'm not particularly attached to the FilteredDataset name, I just want to avoid using the Query name explicitly to ensure we avoid hinting users that it's ever supposed to become a fully fledged query system at the moment. They can use IBIS if they are looking for that.

I also dislike the idea of reusing the Scanner class as it smells like hijacking its data read responsibility. I wanted a name that conveys correctly the idea of something that represents a dataset with an applied transformation and to which additional transformations can be applied. Maybe QueriedDataset could do? Smells a lot like the query already happened, so not exactly what I was looking for. I'm open to suggestions. Naming things correctly seems hard, I could try invalidating some caches instead ;)

jorisvandenbossche · 2022-10-26T10:13:13Z

From #13409 (comment)

If self._filter can be None then what is the advantage of creating a separate FilteredDataset instead of just adding _filter to the existing Dataset?

That was the original implementation, and I was asked to explicitly move it in a dedicated class. Which I think in the end makes sense, better have a single responsibility per class.

Could you expand on this a bit? (I don't know where or why it was asked to move to a dedicated class, the only reference I find in the other PR is the question if this shouldn't live on the Scanner)

It seems to me that if we want to expose a helper filter() method (although it doesn't give that much of value compared to passing the filter to the method that actually will do the scanning, i.e. to_table(..), to_batches(..), etc), adding it just to the main Dataset class will expose the least amount of new API that we "lock in" (it avoids deciding now if we want some "Query" like class)

Co-authored-by: Antoine Pitrou <[email protected]>

Co-authored-by: Weston Pace <[email protected]>

jorisvandenbossche

Thanks for the update!

There are a few more complexities:

There are some other Dataset methods that might also need to take into account this filter: replace_schema, get_fragments.
And what with a UnionDataset unioning filtered datasets ?

We should probably also note somewhere that Fragments don't inherit this filter.
The ParquetFileFragment has some methods that work with filters (split_by_row_group, subset).

docs/source/python/compute.rst

python/pyarrow/_dataset.pyx

python/pyarrow/_exec_plan.pyx

jorisvandenbossche · 2022-11-29T09:27:13Z

python/pyarrow/includes/libarrow.pxd


+    cdef cppclass CSourceNodeOptions "arrow::compute::SourceNodeOptions"(CExecNodeOptions):
+        @staticmethod
+        CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader(


Is this still used in the current version of the PR?

No anymore, but it still had a value on its own, so I didn't remove it. If it's a concern I can easily get rid of it. (Even though I would leave the C++ implementation around).

Co-authored-by: Joris Van den Bossche <[email protected]>

python/pyarrow/_dataset.pyx

amol- · 2022-12-07T13:55:35Z

@jorisvandenbossche I checked for ParquetDataset and the experience is fairly confusing from the end user point of view.

If the dataset is created using ds.parquet_dataset it will have the filter capabilities, but if it's created using pyarrow.parquet.ParquetDataset it won't have filtering capabilities. But ParquetDataset in its V2 implementation is just a proxy to ds.Dataset, so it could in theory gain filtering support.

It seems that ParquetDataset is mostly a duplicate of what ds.parquet_dataset can do when use_legacy_dataset=False, so is there a reason why we keep it around? Is there a plan to deprecate it in the future?

Asking because if the plan is to deprecate it some day, then it probably doesn't make much sense to invest the effort to work toward feature parity with ds.Dataset and we can consider this task done.

jorisvandenbossche · 2022-12-07T14:57:16Z

I think you certainly don't have to care about ParquetDataset in this PR (we generally didn't add any of the additional methods from pyarrow.dataset.Dataset to it, so this PR just follows that). And let's have the discussion about what to do with ParquetDataset in a dedicated place (eg we already have https://issues.apache.org/jira/browse/ARROW-9720, although the description there is a bit outdated)

amol- requested a review from pitrou June 21, 2022 16:55

github-actions bot added the Component: Python label Jun 21, 2022

github-actions bot added Component: C++ Component: Documentation labels Jun 30, 2022

amol- marked this pull request as ready for review July 5, 2022 09:59

rok reviewed Jul 5, 2022

View reviewed changes

cpp/src/arrow/compute/exec/options.cc Outdated Show resolved Hide resolved

pitrou reviewed Jul 5, 2022

View reviewed changes

westonpace reviewed Jul 5, 2022

View reviewed changes

cpp/src/arrow/compute/exec/options.cc Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_exec_plan.pyx Outdated Show resolved Hide resolved

amol- closed this Jul 6, 2022

amol- reopened this Jul 6, 2022

amol- and others added 11 commits November 24, 2022 13:54

Proof of concept

259f7f2

Working joins

e3981e5

lint

4f50a8b

Ensure standard dataset methods keep the filter

545067a

Document Dataset.filter

4441003

Lint

1d33ee3

Add class to reference

9d3b4a9

tweak variable name

e51ac73

Update docs/source/python/compute.rst

87a828f

Co-authored-by: Antoine Pitrou <[email protected]>

Update python/pyarrow/_dataset.pyx

c932b97

Co-authored-by: Antoine Pitrou <[email protected]>

Remove unecessary casts

30d2409

amol- and others added 9 commits November 24, 2022 13:57

better error and docstrings

6523ed5

Tweak docstrings

3476dfb

Update cpp/src/arrow/compute/exec/options.cc

569cd80

Co-authored-by: Weston Pace <[email protected]>

Refactoring

4401496

Allow to create ScanOptions alone

b134425

Working filter and join on filtered datasets

737dac2

lint

94e6713

Test with chained filtering

16afe33

Remove usage of FilteredDataset class

3f83890

amol- added this to the 11.0.0 milestone Nov 28, 2022

jorisvandenbossche reviewed Nov 29, 2022

View reviewed changes

amol- and others added 2 commits November 29, 2022 11:43

Update docs/source/python/compute.rst

e41fd94

Co-authored-by: Joris Van den Bossche <[email protected]>

Disable Dataset.get_fragments() when filtered

9c764ea

jorisvandenbossche reviewed Dec 2, 2022

View reviewed changes

python/pyarrow/_dataset.pyx Show resolved Hide resolved

amol- and others added 6 commits December 2, 2022 17:54

Disable passing None as a filter

ba387f4

2 lines separation between classes

74ba4be

Dataset.replace_schema

5e96eb7

Remove CSourceNodeOptions.FromRecordBatchReader we don't use it anymore

b1cf868

Deal with UnionDataset

75fe9fd

minor fixes

a6c08b1

jorisvandenbossche approved these changes Dec 6, 2022

View reviewed changes

fixup

c2ff428

jorisvandenbossche changed the title ~~ARROW-16616: [Python] Lazy datasets filtering~~ ARROW-16616: [Python] Add Dataset.filter() method Dec 6, 2022

jorisvandenbossche changed the title ~~ARROW-16616: [Python] Add Dataset.filter() method~~ ARROW-16616: [Python] Add lazy Dataset.filter() method Dec 6, 2022

Ensure filtering works in parquet datasets too

1f1d7bd

amol- merged commit f67009a into apache:master Dec 12, 2022

amol- deleted the ARROW-16616 branch December 12, 2022 15:40

asfimport mentioned this pull request Dec 12, 2022

[Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method #31969

Closed

Conversation

amol- commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 21, 2022

Uh oh!

github-actions bot commented Jun 21, 2022

Uh oh!

pitrou commented Jun 28, 2022

Uh oh!

amol- commented Jun 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amol- commented Jul 5, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 7, 2022

Uh oh!

jorisvandenbossche commented Jul 7, 2022

Uh oh!

amol- commented Jul 11, 2022

Uh oh!

jorisvandenbossche commented Oct 26, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

amol- Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amol- commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amol- commented Jun 21, 2022 •

edited

Loading

amol- commented Dec 7, 2022 •

edited

Loading

jorisvandenbossche commented Dec 7, 2022 •

edited

Loading