ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array#13155
ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array#13155amol- wants to merge 4 commits intoapache:masterfrom amol-:ARROW-16469
Conversation
| null_selection_behavior | ||
| How nulls in the mask should be handled. | ||
| How nulls in the mask should be handled, does nothing if | ||
| an :class:`.Expression` is used. |
There was a problem hiding this comment.
This is not possible to pass through to the filter node?
There was a problem hiding this comment.
Not in any way that I can see, the filter node has a pretty straightforward constructor:
explicit FilterNodeOptions(Expression filter_expression, bool async_mode = true), it only accepts an expression.
I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls
There was a problem hiding this comment.
I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls
I don't think is possible to get the "emit null" behaviour by changing the expression (for dropping/keeping, you can explicitly fill the null with False/True, but for preserving the row as null, that's only possible through this option). I suppose that is a good reason this is an option of the filter kernel and not eg comparison kernels.
Anyway, this is not that important given that the "drop" behaviour is the default for both (and is the typical behaviour you want, I think), but this might be something to open a JIRA for to add FilterOptions to the FilterNodeOptions (cc @westonpace would that make sense?)
There was a problem hiding this comment.
Uhm, not sure I follow, why you can't use an expression?
Given
>>> pa.table({"rows": [1, 2, 3, None, 5, 6]})
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null,5,6]]
If I want to drop the nulls, I do
>>> t.filter(pc.field("rows") < 5)
pyarrow.Table
rows: int64
----
rows: [[1,2,3]]
If instead I want to keep the nulls, I do
>>> t.filter((pc.field("rows") < 5) | (pc.field("rows").is_null()))
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null]]
Regarding the "nulls" in the selection mask itself, I don't think FilterNode supports anything different from a boolean Expression, so the option doesn't make much sense in that context.
There was a problem hiding this comment.
The option is about introducing nulls in the output data where the mask is null, not about preserving nulls from the input data. So for preserving nulls in the input, you can change your expression. But for introducing nulls, I don't think that is possible.
There was a problem hiding this comment.
Using your example table:
In [29]: t.filter(pa.array([True, None, True, False, False, False]))
Out[29]:
pyarrow.Table
rows: int64
----
rows: [[1,3]]
vs
In [33]: t.filter(pa.array([True, None, True, False, False, False]), null_selection_behavior="emit_null")
Out[33]:
pyarrow.Table
rows: int64
----
rows: [[1,null,3]]
The null is in a place where the original data had a "2"
|
Benchmark runs are scheduled for baseline = 1483b82 and contender = 71737ea. 71737ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory. Addresses what was discussed in #13155 (comment) - [x] Update documentation - [x] Ensure the filtered dataset preserves the filter when writing it back - [x] Ensure the filtered dataset preserves the filter when joining - [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods. - [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to #13409 (comment)) - [x] Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche - [x] Add support in replace_schema - [x] Error in get_fragments in case a filter is set. - [x] Verify support in UnionDataset Lead-authored-by: Alessandro Molina <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Alessandro Molina <[email protected]>
No description provided.