-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Masked arrays #2301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Masked arrays #2301
Conversation
|
@shoyer @pelson @niallrobinson @rabernat @pwolfram this may interest you or others within your respective organizations |
dask/array/reductions.py
Outdated
| if dtype is not None: | ||
| x = x.astype(dtype) | ||
| return x | ||
| return divide(x1, x2, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems unfortunate. Is there a way to not bake in ma here? What happens if another ma implementation arises? (this was discussed in the __array_ufunc__ PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function and the one above it (empty_type_of) could be easily switched out for a dispatch system, but the changes for arg and cumulative operations less so. My personal preference is to wait until there's a need for another in memory container, and then figure out what needs to be generalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've generalized the easy things to generalize. I'd still rather wait until we have another in-memory container before figuring out what needs to be generalized forthe arg and cumulative operations.
|
The special-casing of |
|
Looks very cool! However, it will be of limited use to xarray, since xarray does not use masked arrays internally. (It uses NaN to represent masked elements.) |
|
This PR is falling out of sync with master. Before I take the time to fix the merge conflicts, is this of actual use to anyone? It would be good to hear from @shoyer, @pelson, @niallrobinson, @pwolfram, or @bjlittle on whether this would be useful for any of the work y'all do. FWIW I think the maintenance costs here are fairly minimal, but it'd be good to have a real world use case before actually merging. |
|
How does Iris handle masked arrays today? Also @njsmith, do you have a sense for the planned longevity of the |
|
@mrocklin At the moment, We're in the process of replacing |
|
How would life change if dask.array supported masked arrays? Would this have near-term positive impact on Iris and the Met office? |
|
IMO we're committed to cutting the next release of Naturally, from our next release, users of So, for me, the near-term positive impact for |
|
Wanted to add something here. From past experience there are many functions operating on NumPy's masked arrays that have bugs, strange behavior, or are incomplete somehow. Feel free to take a look at NumPy's issue tracker for examples. A relevant search of these is linked below. So taking on Masked Array support means handling these sorts of cases somehow or at a bare minimum directing feedback to upstream. This isn't an argument against it. Just trying to make sure you are aware of these problems. Admittedly people that are already using
While I can't speak for Nathaniel, my understanding is Matplotlib makes use of Masked Arrays throughout. So NumPy couldn't drop Masked Arrays without breaking Matplotlib in a pretty big way, which seems like incentive enough to leave it alone. This in spite of occasional rumblings on the NumPy issue tracker to the contrary. |
|
NumPy tries *very* hard to avoid breaking backwards compatibility. You can
count on masked arrays being around for the indefinite future, despite all
of their issues. We haven't even gotten rid of np.matrix, even though all
the NumPy developers agree that it shouldn't be used for new code.
…On Thu, Jun 1, 2017 at 8:21 AM, jakirkham ***@***.***> wrote:
Wanted to add something here. From past experience there are many
functions operating on NumPy's masked arrays that have bugs, strange
behavior, or are incomplete somehow. Feel free to take a look at NumPy's
issue tracker for examples. A relevant search of these is linked below.
So taking on Masked Array support means handling these sorts of cases
somehow or at a bare minimum directing feedback to upstream. This isn't an
argument against it. Just trying to make sure you are aware of these
problems. Admittedly people that are already using numpy.ma are probably
aware of these shortcomings.
ref: https://github.com/numpy/numpy/issues?utf8=%E2%9C%93&q=
is%3Aissue%20is%3Aopen%20label%3A%22component%3A%20numpy.ma%22%20
...do you have a sense for the planned longevity of the numpy.ma module?
While I can't speak for Nathaniel, my understanding is Matplotlib makes
use of Masked Arrays throughout. So NumPy couldn't drop Masked Arrays
without breaking Matplotlib in a pretty big way, which seems like incentive
enough to leave it alone. This in spite of occasional rumblings on the
NumPy issue tracker to the contrary.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2301 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1qvIJDwQWcIcKWfQ9mVukjvQzyvZks5r_tcMgaJpZM4NRcMo>
.
|
|
Perhaps a more optimistic question is "is there likely to be a masked array alternative in the near future?" |
|
I would very much like to see proper NA support in numpy. That definitely doesn't replace all use cases for |
|
I'm in agreement with @shoyer and @njsmith Our whole I'm pretty much in awe of @jcrist's efforts to get this PR up 👍. Totally awesome 🍻 That said, this PR has forced the discussion on whether |
|
So maybe we should do the following:
Thoughts? |
|
The changes here were mostly:
Of these, I'm only really in favor of adding the first separate from masked-array support. The rest would clutter up the code a bit without a specific reason behind them. I'm fine with letting this idle (even closed) until masked arrays are requested - shouldn't take long to bring back up to date. |
Hello @mrocklin et al I'd like to share my perspective, as an Iris developer. I think that the adoption of Dask within Iris is a really positive and exciting step for us. The work to integrate dask into Iris has been fairly intensive, and working around the lack of masked arrays has formed a significant quantity of this work. I think that in the short term, we will be working with dask and patching around the lack of masked array support. However, I feel that this represents a degree of technical debt for our implementation. So, in the near future, I'm interested in dask supporting numpy's masked array implementation, as it feels like the widely used implementation and many libraries using numpy rely on this implementation for their core functionality. I think that driving a conversation about how a future numpy and dask could handle the concept of missing data and processing in different ways is a really valuable conversation. I think that this is a long term activity, which would be informed by dask adopting numpy's masked array as is. I'd like to offer my encouragement for dask to adopt many thanks |
|
Ok, I've brought this PR back in sync with master. I don't think this adds much complexity to @marqh, @bjlittle do you have examples of what kind of operations you'd like to do with masked arrays? Do you have time/motivation to try this PR out? |
pelson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By no means a comprehensive review, but a few noteworthy things:
>>> da.from_array(np.ma.masked_array([1, 2, 3], mask=[1, 0, 0]), chunks=(2,))
Traceback (most recent call last):
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3142, in view
if issubclass(dtype, ndarray):
TypeError: issubclass() arg 1 must be a class
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/pelson/dev/dask/dask/base.py", line 397, in normalize_array
data = hash_buffer_hex(x.ravel(order='K').view('i1'))
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3148, in view
output = ndarray.view(self, dtype)
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3425, in __setattr__
self._mask.shape = self.shape
ValueError: cannot reshape array of size 3 into shape (24,)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3142, in view
if issubclass(dtype, ndarray):
TypeError: issubclass() arg 1 must be a class
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 10, in <module>
ad = da.from_array(np.ma.masked_array([1, 2, 3], mask=[1, 0, 0]), chunks=(2,))
File "/Users/pelson/dev/dask/dask/array/core.py", line 1893, in from_array
token = tokenize(x, chunks)
File "/Users/pelson/dev/dask/dask/base.py", line 426, in tokenize
return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()
File "/Users/pelson/dev/dask/dask/utils.py", line 415, in __call__
return meth(arg)
File "/Users/pelson/dev/dask/dask/base.py", line 399, in normalize_array
data = hash_buffer_hex(x.copy().ravel(order='K').view('i1'))
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3148, in view
output = ndarray.view(self, dtype)
File "/Users/pelson/miniconda/envs/dev-dask/lib/python3.6/site-packages/numpy/ma/core.py", line 3425, in __setattr__
self._mask.shape = self.shape
ValueError: cannot reshape array of size 3 into shape (24,)
>>> da.ma.masked_outside(np.array([1, 2, 3]), 2, 2.5)
Traceback (most recent call last):
File "/Users/pelson/dev/dask/dask/array/ma.py", line 123, in masked_outside
return map_blocks(np.ma.masked_outside, x, v1, v2)
File "/Users/pelson/dev/dask/dask/array/core.py", line 662, in map_blocks
out_ind = tuple(range(max(a.ndim for a in arrs)))[::-1]
ValueError: max() arg is an empty sequence
>>> da.ma.masked_where([1, 2, 3], da.arange(3, chunks=(2, )))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-4270623ca151> in <module>()
4 import dask.array as da
5
----> 6 da.ma.masked_where([1, 2, 3], da.arange(3, chunks=(2, )))
/Users/pelson/dev/dask/dask/array/ma.py in masked_where(condition, a)
132 raise IndexError("Inconsistant shape between the condition and the "
133 "input (got %s and %s)" % (cshape, a.shape))
--> 134 return map_blocks(np.ma.masked_where, condition, a)
135
136
/Users/pelson/dev/dask/dask/array/core.py in map_blocks(func, *args, **kwargs)
690 else:
691 kwargs2 = kwargs
--> 692 dtype = apply_infer_dtype(func, args, kwargs2, 'map_blocks')
693
694 if len(arrs) == 1:
/Users/pelson/dev/dask/dask/array/core.py in apply_infer_dtype(func, args, kwargs, funcname, suggest_dtype)
525 msg = None
526 if msg is not None:
--> 527 raise ValueError(msg)
528 return o.dtype
529
ValueError: `dtype` inference failed in `map_blocks`.
Please specify the dtype explicitly using the `dtype` kwarg.
Original error is below:
------------------------
IndexError('Inconsistant shape between the condition and the input (got (3,) and (1,))',)
Traceback:
---------
File "/Users/pelson/dev/dask/dask/array/core.py", line 510, in apply_infer_dtype
o = func(*args, **kwargs)
File "/Users/pelson/miniconda/lib/python3.5/site-packages/numpy/ma/core.py", line 1910, in masked_where
" (got %s and %s)" % (cshape, ashape))
(I guess this is because the mask should be chunked in the same way as the array)
I was somewhat concerned about the numerical accuracy of things like standard deviation - it would be easy to lose precision with the repeated addition result chunks. Biggus goes to some length to implement a streaming single pass standard deviation for both masked and un-masked data, but I don't have an example that justifies that implementation vs what has been done here.
I'm really pleased to say that a cursory experimentation suggests that the accuracy is very good. My code:
np.random.seed(0)
rand = np.random.randn(100000) ** 20
# ~4% masked numbers
data = np.ma.masked_outside(rand, -2, 2)
chunks = (rand.size // 500,)
da_data = da.ma.masked_where(da.from_array(data.mask, chunks=chunks),
da.from_array(rand, chunks=chunks))
print(biggus.std(data, ddof=2, axis=0).masked_array())
print(np.std(da_data, ddof=2).compute())
print(np.std(da_data.compute(), ddof=2))
which gives:
0.28873287828340444
0.288732878283
0.288732878283
I also visualised the graph with 5 equal chunks:

Which, just like the biggus implementation has a (necessary) bottleneck at the sqrt, but the preparation before that point parallelises well (just like it does with a dask.array).
In summary
This is a really great implementation - I've pointed out a few usability issues, but in principle I believe the implementation is a viable option for entirely bridging the remaining functionality gap between biggus and dask. With some refinement of this implementation I'm 100% supportive of moving towards formally deprecating biggus in favour of dask (note: I'm a core biggus dev).
| def _cumsum_merge(a, b): | ||
| if isinstance(a, np.ma.masked_array) or isinstance(b, np.ma.masked_array): | ||
| values = np.ma.getdata(a) + np.ma.getdata(b) | ||
| return np.ma.masked_array(values, mask=np.ma.getmaskarray(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note: I'm reading this from a high level, so haven't fully understood its purpose)
Shouldn't this be a combination of a and b's masks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for cumulative operations the mask stays fixed throughout and the operation only occurs on the data. In this case b is whatever chunk already existed at that chunk-location, and a is the results from cumsum/cumprod all previous chunks along the axis.
|
|
||
| def divide(a, b, dtype=None): | ||
| key = lambda x: getattr(x, '__array_priority__', float('-inf')) | ||
| f = divide_lookup.dispatch(type(builtins.max(a, b, key=key))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised there isn't use of the __array_priority__ dispatching elsewhere (maybe there is). It seems like a fairly common requirement of Dispatch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is divide different from other ufuncs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use np.ma.divide for masked arrays, and our patched version of np.divide for non-masked arrays. If we use np.divide for masked arrays, then the function applies to all elements in .data, which may result in RuntimeWarnings due to divide-by-zero. Note that this is only true for the functional operators, using the operators instead dispatches properly (e.g. divide vs /).
This is related to my question about implementing all the masked ufuncs. Normal ufuncs work and return instances of np.ma.masked_array, but apply to all elements in .data instead of only the non-masked ones. Note that even if we did implement masked ufuncs, we'd still need to do this dispatch here in the reductions code.
|
@jcrist - great stuff! 👍 |
|
Did you try with |
|
Thanks for the review @pelson. Responding to a few comments: The errors you're seeing in One question is what do you expect to happen if you call a The
Thoughts?
We also work hard to implement numerically stable parallel algorithms. The one used here is from http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf and has been working well for us. It's used for computing all our moments (
Just to clarify, depending on your chunking and reduction axis there may be multiple parallel calls to |
Personally, I'd expect to always have a lazy dask thing if I've called dask functions/methods. If I wanted immediate, I'd call numpy directly...
At this moment, I'm afraid I don't have enough background to possibly give an educated answer.
Great. Biggus used a method described in |
|
@jcrist many thanks for the updates
key operations I know about with masks include:
Are you interested in general problem statements like this, or are you more keen for actual examples of data processing which could form test cases? Our current implementation converts masked arrays to NaN filled float arrays, which gives partial support, but there are edge cases which are causing concerns and lacks of capability for some cases.
motivation, for sure, time is more of a challenge just now, i'll let you know if I make progress. |
|
I'm running into some hard-to-solve issues with masked arrays if the I'd like to not handle this case and only work with scalar
If given no direction, I plan to error explicitly where easy/possible, and forbid operations that could result in this case (e.g. forbid |
Bugs in implementation of masked arrays before then make this difficult.
Move some masked array specific functions to general dispatches
- Add support for non-dask objects in masked operations - Properly handle non-equal chunking by using elemwise where required
Failures after merging new elemwise dtype inference code.
- Add api docs - Update sparse docs on arbitrary chunk types
|
Fine by me. I've fixed the merge conflicts and updated some docs. I think this is good to go now. |
|
+1 from me |
|
Alright, this is in. Thanks everyone for reviewing/testing this PR. |
- Add support for masked arrays as chunks - Add api docs for mask arrays - Update sparse docs on arbitrary chunk types
- Add support for masked arrays as chunks - Add api docs for mask arrays - Update sparse docs on arbitrary chunk types
Adds support for masked arrays, similar to how we support sparse arrays.
Supports:
da.ma.masked_greaterda.ma.getmaskarrayda.ma.set_fill_valueThe last one is a little weird because it matches the numpy api of mutating the array instead of returning a new one. I figured it was better to match numpy here than to match dask, but could go either way.
Fixes #1928.
Note that this adds a type-level registry for package-level functions like
concatenateandtensordotinstead of using thepackage_offunction from before. This is necessary because:package_ofreturnsnpfor masked arraysnp.ma.concatenatedoesn't persist thefill_valueof its inputsI also think this is cleaner than relying on inspecting the object to find its module.