Skip to content

Add drop=True option for where on Dataset and DataArray#815

Merged
shoyer merged 2 commits intopydata:masterfrom
pwolfram:add_sel_where
Apr 10, 2016
Merged

Add drop=True option for where on Dataset and DataArray#815
shoyer merged 2 commits intopydata:masterfrom
pwolfram:add_sel_where

Conversation

@pwolfram
Copy link
Copy Markdown
Contributor

@pwolfram pwolfram commented Apr 1, 2016

Addresses #811 to provide a Dataset and DataArray sel_where which returns a Dataset or DataArray of minimal coordinate size.

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer, I haven't added any specific documentation to the user guide. Would you recommend doing so and if so, should this go near here https://github.com/pydata/xarray/blame/master/doc/indexing.rst#L265?

(mask, Dataset, DataArray))

# clip the data corresponding to coordinate dims that are not used
clip = dict(zip(clipmask.dims,[np.unique(adim) for adim in np.where(clipmask)]))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.where with one argument is just an alias np.nonzero, so I would opt for using np.nonzero directly instead.

Also, it's probably a good idea to use clipmask.values, just in case np.nonzero ever starts doing something funny on xarray.DataArray objects. In practice, this should work the same but be slightly more explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it and np.nonzero(clipmask) returned an error or

E           ValueError: dimensions ('x',) must have the same length as the number of data dimensions, ndim=2

but we are all clear with np.nonzero(clipmask.values). Is this a bug?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's a numpy bug or not but it's exactly the sort of thing I was worried about :).

Let's use np.nonzero(clipmask.values) or clipmask.values.nonzero()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the first.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 1, 2016

Yes, expanding the section on "where" seems like the right place to put
this.

On Fri, Apr 1, 2016 at 11:08 AM, Phillip Wolfram [email protected]
wrote:

@shoyer https://github.com/shoyer, I haven't added any specific
documentation. Would you recommend doing so and if so, should this go near
here https://github.com/pydata/xarray/blame/master/doc/indexing.rst#L265?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#815 (comment)

@fmaussion
Copy link
Copy Markdown
Member

This is very cool! Would it also work in that case?

a = xr.DataArray(np.arange(5**3).reshape(5, 5, 5), dims=('x', 'y', 'z'))
mask = xr.DataArray(np.arange(5**2).reshape(5, 5), dims=('x', 'y'))
a.sel_where(np.logical_and(mask > 6, mask < 18))

clipmask = mask.to_array().any('variable')
elif isinstance(mask, DataArray):
clipmask = mask
elif isinstance(mask, bool):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this special case makes sense

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure either but included it for completeness. I'm removing it.

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@fmaussion, for

array = DataArray(np.arange(5**3).reshape(5, 5, 5), dims=('x', 'y', 'z'))
mask = DataArray(np.arange(5**2).reshape(5, 5), dims=('x', 'y'))
actual = array.sel_where(np.logical_and(mask > 6, mask < 18))

it returns this

(Pdb) actual
<xarray.DataArray (x: 3, y: 5, z: 5)>
array([[[ nan,  nan,  nan,  nan,  nan],
        [ nan,  nan,  nan,  nan,  nan],
        [ 35.,  36.,  37.,  38.,  39.],
        [ 40.,  41.,  42.,  43.,  44.],
        [ 45.,  46.,  47.,  48.,  49.]],

       [[ 50.,  51.,  52.,  53.,  54.],
        [ 55.,  56.,  57.,  58.,  59.],
        [ 60.,  61.,  62.,  63.,  64.],
        [ 65.,  66.,  67.,  68.,  69.],
        [ 70.,  71.,  72.,  73.,  74.]],

       [[ 75.,  76.,  77.,  78.,  79.],
        [ 80.,  81.,  82.,  83.,  84.],
        [ 85.,  86.,  87.,  88.,  89.],
        [ nan,  nan,  nan,  nan,  nan],
        [ nan,  nan,  nan,  nan,  nan]]])
Coordinates:
  * x        (x) int64 1 2 3
  * y        (y) int64 0 1 2 3 4
  * z        (z) int64 0 1 2 3 4

where

(Pdb) array
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[  0,   1,   2,   3,   4],
        [  5,   6,   7,   8,   9],
        [ 10,  11,  12,  13,  14],
        [ 15,  16,  17,  18,  19],
        [ 20,  21,  22,  23,  24]],

       [[ 25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34],
        [ 35,  36,  37,  38,  39],
        [ 40,  41,  42,  43,  44],
        [ 45,  46,  47,  48,  49]],

       [[ 50,  51,  52,  53,  54],
        [ 55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64],
        [ 65,  66,  67,  68,  69],
        [ 70,  71,  72,  73,  74]],

       [[ 75,  76,  77,  78,  79],
        [ 80,  81,  82,  83,  84],
        [ 85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94],
        [ 95,  96,  97,  98,  99]],

       [[100, 101, 102, 103, 104],
        [105, 106, 107, 108, 109],
        [110, 111, 112, 113, 114],
        [115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124]]])
Coordinates:
  * x        (x) int64 0 1 2 3 4
  * y        (y) int64 0 1 2 3 4
  * z        (z) int64 0 1 2 3 4
(Pdb) np.logical_and(mask > 6, mask < 18)
<xarray.DataArray (x: 5, y: 5)>
array([[False, False, False, False, False],
       [False, False,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True, False, False],
       [False, False, False, False, False]], dtype=bool)
Coordinates:
  * x        (x) int64 0 1 2 3 4
  * y        (y) int64 0 1 2 3 4

Is this what you were expecting?

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 1, 2016

At some point it might be useful to be able to use dask to compute the mask (with mask.data.nonzero()), so I made an issue for that: dask/dask#1076

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer, thanks for submitting the dask issue.

@max-sixty
Copy link
Copy Markdown
Collaborator

Forgive me if this is naive - is this equivalent to a pandas slice?

@fmaussion
Copy link
Copy Markdown
Member

@pwolfram thanks. It's hard to tell form the print, but from the dimensions of the output it looks fine. The real use case would if you have (x,y,time), you would like to sel_where based on x,y only (for example a landmask or something).

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer, user documentation added in 96677b5

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 1, 2016

@MaximilianR In this example I used where to produce a map:

>>> ds.states.where(ds.states == state_ids['California']).plot()

image

If I used sel_where, the plot would be automatically trimmed. That's the main usecase here. I don't think there is an equivalent shortcut in pandas.

@fmaussion that's correct that you cannot do lazy computation with the result of nonzero on a dask array. However, computing nonzero itself could be an expensive operation if the mask is large, and we could do that with dask.

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer, squashed the commits for simplicity.

def test_sel_where(self):

# 1d
# data array case
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test case with more interesting coordinates? i.e., something that would broken your previous code where you used sel mistakenly instead of isel.

@max-sixty
Copy link
Copy Markdown
Collaborator

Right, I see. I think it's equivalent only to a pandas slice on Series, rather than a DataFrame.

In [71]: series = pd.Series(range(10))

In [72]: series[series>5]
Out[72]: 
6    6
7    7
8    8
9    9
dtype: int64

# ...same as sel_where


In [78]: df =pd.DataFrame(pd.np.arange(40).reshape(4,10))

In [79]: df[(df>11) & (df < 28)]
Out[79]: 
    0   1   2   3   4   5   6   7   8   9
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN  12  13  14  15  16  17  18  19
2  20  21  22  23  24  25  26  27 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

# ...same as where

@max-sixty
Copy link
Copy Markdown
Collaborator

FWIW I'm not a fan of the sel_where name, sel is otherwise associated with labels, and this takes bools.
Have you thought about including this in where, with a kwarg such as drop?

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer et al, I think @MaximilianR's suggestion makes sense. This would reduce the amount of code that has to be accounted for too and would be cleaner. Essentially have this method be called via array.where(mask, drop=True) instead of array.sel_where(mask).

All feedback on this proposal welcome.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 1, 2016

+1 for drop=True

On Fri, Apr 1, 2016 at 1:59 PM, Phillip Wolfram [email protected]
wrote:

@shoyer https://github.com/shoyer et al, I think @MaximilianR
https://github.com/MaximilianR's suggestion makes sense. This would
reduce the amount of code that has to be accounted for too and would be
cleaner. Essentially have this method be called via array.where(mask,
drop=True) instead of array.sel_where(mask).

All feedback on this proposal welcome.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#815 (comment)

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer, @MaximilianR, and @fmaussion this is done and I added a few more test cases with nonuniform coordinates to hopefully ensure there aren't any hidden bugs.

@pwolfram pwolfram changed the title Add sel_where to Datset and DataArray Add where drop=True option for Datset and DataArray Apr 1, 2016
@pwolfram pwolfram changed the title Add where drop=True option for Datset and DataArray Add drop=True option for where on Datset and DataArray Apr 1, 2016
@shoyer shoyer changed the title Add drop=True option for where on Datset and DataArray Add drop=True option for where on Dataset and DataArray Apr 4, 2016
doc/indexing.rst Outdated

By default ``where`` maintains the original size of the data. For cases
where the selected data size is much smaller than the original data,
use of the option ``drop='True'`` clips coordinate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be just drop=True (not quotes)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Parameters
----------
cond : boolean DataArray or Dataset
drop : boolean specifying whether coordinate elements composed of only
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more accurate to say "coordinate labels that only correspond to NA values should be dropped" -- the coordinates are not NaN, but rather the data.

also, this is not quite formatted correctly (the type should be on its own):

drop : boolean, optional
    Some text description

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 4, 2016

looks good to me other than a few nits

Please also add a release note crediting yourself!

@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 4, 2016

@shoyer, I believe I have addressed your comments and this should be ready to merge. Thanks!

@pwolfram pwolfram force-pushed the add_sel_where branch 3 times, most recently from 71581c2 to 27b49c0 Compare April 4, 2016 20:16
* y (y) int64 0 1 2 3 4
"""
return self._where(cond)
if drop:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given pandas & numpy's where function takes another arg after cond (other), I think we should check for drop's type to ensure anyone who passes that gets caught early

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is an arg that could be kwarg-only in python3. (but we can't do that here)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Maybe we should add an other=None argument as a place holder for now, simply raising NotImplementedError if other is not None. I'm pretty sure we'll want to add that argument eventually, anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer and @MaximilianR, thanks for catching this potential issue. Please see the code I just pushed to address this issue. I believe this should be ready to be merged now. I've intentionally left the other option as not implemented for the documentation too, e.g., https://github.com/pydata/xarray/pull/815/files#diff-aec89f8189374cf98061efad7425f561R464

return result

def where(self, cond):
def where(self, cond, drop=False, other=None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other should come before drop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks @MaximilianR

By default ``where`` maintains the original size of the data.  For cases
where the selected data size is much smaller than the original data, use
of the option ``drop=True`` clips coordinate elements that are fully
masked.
@pwolfram
Copy link
Copy Markdown
Contributor Author

pwolfram commented Apr 7, 2016

Thanks @MaximilianR and @shoyer! I think we are at the last call before the merge now.


if drop:
from .dataarray import DataArray
from .dataset import Dataset
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit: I think it's a little cleaner to put imports at the method/function definition.

@shoyer shoyer merged commit 28ee86a into pydata:master Apr 10, 2016
@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 10, 2016

Thanks @pwolfram !

@pwolfram
Copy link
Copy Markdown
Contributor Author

My pleasure, thank you @shoyer for all the really great help designing the code and making it much better!

@pwolfram pwolfram deleted the add_sel_where branch September 20, 2016 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants