Fix the average function when there is a masked array #4236

garaud · 2018-11-23T13:16:35Z

The Dask average function does not take into account the masked when the sum of weights is computed.

Add a unit test dedicated to this issue.

Tests added / passed
Passes flake8 dask

close #3846

mrocklin · 2018-11-23T13:25:21Z

cc @dkillick in case you have time to review

DPeterK

Looks good. Thanks @garaud!

garaud · 2018-11-23T14:30:11Z

Thanks!

Travis failed on

from .ma import getmaskarray
E   ImportError: dask.array.ma requires numpy >= 1.11.2

The Python 3.5 build uses numpy 1.11.1 whereas the other build, e.g. 3.6 uses numpy 1.14.1

DPeterK · 2018-11-23T14:31:22Z

Travis failed

It did... I'm thinking about that right now - I'll get back to you about that soon!

DPeterK · 2018-11-23T15:05:10Z

So the failure is being caused by this import of dask.array.ma. The import causes the NumPy version difference to become a problem, which it wasn't before because dask.array.ma isn't imported anywhere else in the codebase.

I'm not quite sure what to do about this. The obvious easy fix is to increase the NumPy version in the failing test, but I'm not sure whether that version of NumPy was chosen for a good, but not immediately obvious, reason. Otherwise we need to remove the problem import line, but that too has problems: do we duplicate the getmaskarray functionality outside of dask.array.ma? Or, like NumPy, do we duplicate the average function to some extent in the masked array module?

@jcrist do you have any thoughts on this?

mrocklin · 2018-11-23T16:08:40Z

Perhaps we guard the import on this line?

    wgt = wgt * (~getmaskarray(a))

if numpy.__version__ > ...;
    from . import ma
    wgt = wgt * ~ma.getmaskarray(a)
...

If the Numpy version number is less than that version then presumably the array isn't a dask masked array anyway, so this should be safe?

garaud · 2018-11-23T17:04:58Z

I think that code duplicate is not a good idea.

In that case, I think pragmatic: you can set the minor numpy version to 1.11.2 here https://github.com/dask/dask/blob/master/continuous_integration/travis/install.sh#L68 instead of 1.11 The latest digit is for bug fixes, this change should have a tiny impact.

For the other Python versions: 2.7, 3.6 and 3.7, the version of numpy is > 1.11.2. We can do the same for py35. I think we can try a numpy >= 1.11.2 in the travis build and let's see what's happened.

If the minor version of numpy is 1.11 for some good reasons and we don't want to upgrade to the minor version 1.11.2, I'll update the PR with the suggestion of @mrocklin

mrocklin · 2018-11-23T17:06:23Z

I don't personally know why the minimum version is set the way it is. @jakirkham might have thoughts.

…

On Fri, Nov 23, 2018 at 12:04 PM Damien Garaud ***@***.***> wrote: I think that code duplicate is not a good idea. In that case, I think pragmatic: you can set the minor numpy version to 1.11.2 here https://github.com/dask/dask/blob/master/continuous_integration/travis/install.sh#L68 instead of 1.11 The latest digit is for bug fixes, this change should have a tiny impact. For the other Python versions: 2.7, 3.6 and 3.7, the version of numpy is > 1.11.2. We can do the same for py35. I think we can try a numpy >= 1.11.2 in the travis build and let's see what's happened. If the minor version of numpy is 1.11 for some good reasons and we don't want to upgrade to the minor version 1.11.2, I'll update the PR with the suggestion of @mrocklin <https://github.com/mrocklin> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLcWNLz2n9g8zlQXck85HD0LLxPGks5uyCq7gaJpZM4YwpMA> .

DPeterK · 2018-11-23T17:12:14Z

I think that code duplicate is not a good idea.

I agree - the only reason I suggested it at all is because that's what NumPy does, I think (ma has a separate average function).

garaud · 2018-11-23T18:14:01Z

If the numpy version is updated, we also have to update the setup.py here https://github.com/dask/dask/blob/master/setup.py#L11

This change implies removing the support for numpy < 1.11.2. I'm going to create a py3.5 env with this numpy version constraint and launch pytest. I hope there is a well-fitted versions combination between all dependencies.

garaud · 2018-11-23T22:09:56Z

All tests passed with:

python 3.5
numpy 1.11.3
pandas 0.19.2

garaud · 2018-11-26T07:01:38Z

Curious. For unknown reasons, some tests where there are some da.random.random calls failed. Or from sparse.COO.from_numpy functions. I really don't know.

jcrist · 2018-11-26T14:14:29Z

dask/array/routines.py

            wgt = wgt.swapaxes(-1, axis)
-
+        # if there is a masked array
+        wgt = wgt * (~getmaskarray(a))


Rather than doing this here, I'd rather that a thin wrapper around the existing functionality was implemented in the da.ma module. numpy's average function doesn't work for masked arrays, but np.ma.average does.

The best way to handle this might be to extract this full method into a helper method that takes an additional is_masked boolean (or something), and optionally applies this line.

Suggested change

wgt = wgt * (~getmaskarray(a))

if is_masked:

from .ma import getmaskarray

wgt = wgt * (~getmaskarray(a))

Then you'd define da.average with is_masked set to False, and da.ma.average with is_masked set to True.

This would also fix the numpy version import issue, as this line would only execute if called from within da.ma.average.

Updated with your suggestion.

In the review process, I don't know if it's me or you which click on the "Resolve conversation" button. I left it as it is for now.

jakirkham · 2018-11-26T17:38:06Z

I don't personally know why the minimum version is set the way it is. @jakirkham might have thoughts.

This was probably just easy to do at the time. If bumping the patch version helps, I'd recommend doing that.

mrocklin · 2018-11-26T17:58:58Z

Sounds good o me

…

On Mon, Nov 26, 2018 at 12:38 PM jakirkham ***@***.***> wrote: I don't personally know why the minimum version is set the way it is. @jakirkham <https://github.com/jakirkham> might have thoughts. This was probably just easy to do at the time. If bumping the patch version helps, I'd recommend doing that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGntcAQCpgGUIAtvwsPX32Xz2sBzks5uzCb_gaJpZM4YwpMA> .

garaud · 2018-11-28T08:39:06Z

Thanks for your suggestion @jakirkham ! I'll update it this week.

The Dask average function does not take into account the masked when the sum of weights is computed.

Sum of weights takes into account the masked array if exists. close dask#3846

add the _average wrapper which shared the average implementation between da.average and da.ma.average

garaud · 2018-11-28T22:02:35Z

I moved the test dedicated to the average computation with masked array from test_routines.py to test_ma.py
The function da.average calls the _average wrapper with is_masked=False
The function da.ma.average calls the same wrapper with is_masked=True

I remove two commits from my branch about the numpy version and the authors.md file which was removed.

DPeterK

This looks good. I like the implementation here with the thin wrapper 👍

mrocklin · 2018-11-29T12:46:48Z

@jcrist is there anything left here that you think needs to be handled?

jcrist · 2018-11-29T17:51:29Z

Nope, looks good to me. Thanks!

DPeterK approved these changes Nov 23, 2018

View reviewed changes

jcrist requested changes Nov 26, 2018

View reviewed changes

Damien Garaud added 2 commits November 28, 2018 22:31

add a unit test about the average function and masked arrays

24cdce4

The Dask average function does not take into account the masked when the sum of weights is computed.

Fix the average function when there is a masked array

cfd2e01

Sum of weights takes into account the masked array if exists. close dask#3846

garaud force-pushed the fix-average-masked-weights branch from e991e7c to cfd2e01 Compare November 28, 2018 21:31

Damien Garaud added 2 commits November 28, 2018 22:50

Move the average with masked array test to test_ma.py

bdfde84

Implement the average with masked array in the dask.ma module

0f50c76

add the _average wrapper which shared the average implementation between da.average and da.ma.average

DPeterK approved these changes Nov 29, 2018

View reviewed changes

jcrist approved these changes Nov 29, 2018

View reviewed changes

jcrist merged commit 1cff482 into dask:master Nov 29, 2018

garaud deleted the fix-average-masked-weights branch November 29, 2018 19:23

rcomer mentioned this pull request Oct 7, 2022

Lazy weighted RMS calculation SciTools/iris#5017

Merged

Uh oh!

Fix the average function when there is a masked array #4236

Fix the average function when there is a masked array #4236

Uh oh!

Conversation

garaud commented Nov 23, 2018

Uh oh!

mrocklin commented Nov 23, 2018

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

garaud commented Nov 23, 2018

Uh oh!

DPeterK commented Nov 23, 2018

Uh oh!

DPeterK commented Nov 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Nov 23, 2018

Uh oh!

garaud commented Nov 23, 2018

Uh oh!

mrocklin commented Nov 23, 2018 via email

Uh oh!

DPeterK commented Nov 23, 2018

Uh oh!

garaud commented Nov 23, 2018

Uh oh!

garaud commented Nov 23, 2018

Uh oh!

garaud commented Nov 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist Nov 26, 2018

Choose a reason for hiding this comment

Uh oh!

jcrist Nov 26, 2018

Choose a reason for hiding this comment

Uh oh!

garaud Nov 28, 2018

Choose a reason for hiding this comment

Uh oh!

garaud Nov 29, 2018

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Nov 26, 2018

Uh oh!

mrocklin commented Nov 26, 2018 via email

Uh oh!

garaud commented Nov 28, 2018

Uh oh!

garaud commented Nov 28, 2018

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 29, 2018

Uh oh!

jcrist commented Nov 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DPeterK commented Nov 23, 2018 •

edited

Loading

garaud commented Nov 26, 2018 •

edited

Loading