Add drop=True option for where on Dataset and DataArray#815
Add drop=True option for where on Dataset and DataArray#815shoyer merged 2 commits intopydata:masterfrom
Conversation
|
@shoyer, I haven't added any specific documentation to the user guide. Would you recommend doing so and if so, should this go near here https://github.com/pydata/xarray/blame/master/doc/indexing.rst#L265? |
xarray/core/common.py
Outdated
| (mask, Dataset, DataArray)) | ||
|
|
||
| # clip the data corresponding to coordinate dims that are not used | ||
| clip = dict(zip(clipmask.dims,[np.unique(adim) for adim in np.where(clipmask)])) |
There was a problem hiding this comment.
np.where with one argument is just an alias np.nonzero, so I would opt for using np.nonzero directly instead.
Also, it's probably a good idea to use clipmask.values, just in case np.nonzero ever starts doing something funny on xarray.DataArray objects. In practice, this should work the same but be slightly more explicit.
There was a problem hiding this comment.
I just tried it and np.nonzero(clipmask) returned an error or
E ValueError: dimensions ('x',) must have the same length as the number of data dimensions, ndim=2
but we are all clear with np.nonzero(clipmask.values). Is this a bug?
There was a problem hiding this comment.
I'm not sure if that's a numpy bug or not but it's exactly the sort of thing I was worried about :).
Let's use np.nonzero(clipmask.values) or clipmask.values.nonzero()
There was a problem hiding this comment.
I went with the first.
|
Yes, expanding the section on "where" seems like the right place to put On Fri, Apr 1, 2016 at 11:08 AM, Phillip Wolfram [email protected]
|
|
This is very cool! Would it also work in that case? a = xr.DataArray(np.arange(5**3).reshape(5, 5, 5), dims=('x', 'y', 'z'))
mask = xr.DataArray(np.arange(5**2).reshape(5, 5), dims=('x', 'y'))
a.sel_where(np.logical_and(mask > 6, mask < 18)) |
xarray/core/common.py
Outdated
| clipmask = mask.to_array().any('variable') | ||
| elif isinstance(mask, DataArray): | ||
| clipmask = mask | ||
| elif isinstance(mask, bool): |
There was a problem hiding this comment.
I'm not sure this special case makes sense
There was a problem hiding this comment.
I wasn't sure either but included it for completeness. I'm removing it.
|
@fmaussion, for array = DataArray(np.arange(5**3).reshape(5, 5, 5), dims=('x', 'y', 'z'))
mask = DataArray(np.arange(5**2).reshape(5, 5), dims=('x', 'y'))
actual = array.sel_where(np.logical_and(mask > 6, mask < 18))it returns this (Pdb) actual
<xarray.DataArray (x: 3, y: 5, z: 5)>
array([[[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan],
[ 35., 36., 37., 38., 39.],
[ 40., 41., 42., 43., 44.],
[ 45., 46., 47., 48., 49.]],
[[ 50., 51., 52., 53., 54.],
[ 55., 56., 57., 58., 59.],
[ 60., 61., 62., 63., 64.],
[ 65., 66., 67., 68., 69.],
[ 70., 71., 72., 73., 74.]],
[[ 75., 76., 77., 78., 79.],
[ 80., 81., 82., 83., 84.],
[ 85., 86., 87., 88., 89.],
[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan]]])
Coordinates:
* x (x) int64 1 2 3
* y (y) int64 0 1 2 3 4
* z (z) int64 0 1 2 3 4where (Pdb) array
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
[ 20, 21, 22, 23, 24]],
[[ 25, 26, 27, 28, 29],
[ 30, 31, 32, 33, 34],
[ 35, 36, 37, 38, 39],
[ 40, 41, 42, 43, 44],
[ 45, 46, 47, 48, 49]],
[[ 50, 51, 52, 53, 54],
[ 55, 56, 57, 58, 59],
[ 60, 61, 62, 63, 64],
[ 65, 66, 67, 68, 69],
[ 70, 71, 72, 73, 74]],
[[ 75, 76, 77, 78, 79],
[ 80, 81, 82, 83, 84],
[ 85, 86, 87, 88, 89],
[ 90, 91, 92, 93, 94],
[ 95, 96, 97, 98, 99]],
[[100, 101, 102, 103, 104],
[105, 106, 107, 108, 109],
[110, 111, 112, 113, 114],
[115, 116, 117, 118, 119],
[120, 121, 122, 123, 124]]])
Coordinates:
* x (x) int64 0 1 2 3 4
* y (y) int64 0 1 2 3 4
* z (z) int64 0 1 2 3 4(Pdb) np.logical_and(mask > 6, mask < 18)
<xarray.DataArray (x: 5, y: 5)>
array([[False, False, False, False, False],
[False, False, True, True, True],
[ True, True, True, True, True],
[ True, True, True, False, False],
[False, False, False, False, False]], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
* y (y) int64 0 1 2 3 4Is this what you were expecting? |
|
At some point it might be useful to be able to use dask to compute the mask (with |
|
@shoyer, thanks for submitting the dask issue. |
|
Forgive me if this is naive - is this equivalent to a pandas slice? |
|
@pwolfram thanks. It's hard to tell form the print, but from the dimensions of the output it looks fine. The real use case would if you have (x,y,time), you would like to sel_where based on x,y only (for example a landmask or something). |
|
@MaximilianR In this example I used If I used @fmaussion that's correct that you cannot do lazy computation with the result of |
|
@shoyer, squashed the commits for simplicity. |
| def test_sel_where(self): | ||
|
|
||
| # 1d | ||
| # data array case |
There was a problem hiding this comment.
can we add a test case with more interesting coordinates? i.e., something that would broken your previous code where you used sel mistakenly instead of isel.
|
Right, I see. I think it's equivalent only to a pandas slice on In [71]: series = pd.Series(range(10))
In [72]: series[series>5]
Out[72]:
6 6
7 7
8 8
9 9
dtype: int64
# ...same as sel_where
In [78]: df =pd.DataFrame(pd.np.arange(40).reshape(4,10))
In [79]: df[(df>11) & (df < 28)]
Out[79]:
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# ...same as where |
|
FWIW I'm not a fan of the |
|
@shoyer et al, I think @MaximilianR's suggestion makes sense. This would reduce the amount of code that has to be accounted for too and would be cleaner. Essentially have this method be called via All feedback on this proposal welcome. |
|
+1 for drop=True On Fri, Apr 1, 2016 at 1:59 PM, Phillip Wolfram [email protected]
|
|
@shoyer, @MaximilianR, and @fmaussion this is done and I added a few more test cases with nonuniform coordinates to hopefully ensure there aren't any hidden bugs. |
doc/indexing.rst
Outdated
|
|
||
| By default ``where`` maintains the original size of the data. For cases | ||
| where the selected data size is much smaller than the original data, | ||
| use of the option ``drop='True'`` clips coordinate |
There was a problem hiding this comment.
should be just drop=True (not quotes)
xarray/core/common.py
Outdated
| Parameters | ||
| ---------- | ||
| cond : boolean DataArray or Dataset | ||
| drop : boolean specifying whether coordinate elements composed of only |
There was a problem hiding this comment.
I think it's more accurate to say "coordinate labels that only correspond to NA values should be dropped" -- the coordinates are not NaN, but rather the data.
also, this is not quite formatted correctly (the type should be on its own):
drop : boolean, optional
Some text description
|
looks good to me other than a few nits Please also add a release note crediting yourself! |
|
@shoyer, I believe I have addressed your comments and this should be ready to merge. Thanks! |
71581c2 to
27b49c0
Compare
| * y (y) int64 0 1 2 3 4 | ||
| """ | ||
| return self._where(cond) | ||
| if drop: |
There was a problem hiding this comment.
Given pandas & numpy's where function takes another arg after cond (other), I think we should check for drop's type to ensure anyone who passes that gets caught early
There was a problem hiding this comment.
Indeed this is an arg that could be kwarg-only in python3. (but we can't do that here)
There was a problem hiding this comment.
Good catch. Maybe we should add an other=None argument as a place holder for now, simply raising NotImplementedError if other is not None. I'm pretty sure we'll want to add that argument eventually, anyways.
There was a problem hiding this comment.
@shoyer and @MaximilianR, thanks for catching this potential issue. Please see the code I just pushed to address this issue. I believe this should be ready to be merged now. I've intentionally left the other option as not implemented for the documentation too, e.g., https://github.com/pydata/xarray/pull/815/files#diff-aec89f8189374cf98061efad7425f561R464
xarray/core/common.py
Outdated
| return result | ||
|
|
||
| def where(self, cond): | ||
| def where(self, cond, drop=False, other=None): |
There was a problem hiding this comment.
other should come before drop
By default ``where`` maintains the original size of the data. For cases where the selected data size is much smaller than the original data, use of the option ``drop=True`` clips coordinate elements that are fully masked.
|
Thanks @MaximilianR and @shoyer! I think we are at the last call before the merge now. |
|
|
||
| if drop: | ||
| from .dataarray import DataArray | ||
| from .dataset import Dataset |
There was a problem hiding this comment.
Just a nit: I think it's a little cleaner to put imports at the method/function definition.
|
Thanks @pwolfram ! |
|
My pleasure, thank you @shoyer for all the really great help designing the code and making it much better! |

Addresses #811 to provide a Dataset and DataArray
sel_wherewhich returns a Dataset or DataArray of minimal coordinate size.