Skip to content

Conversation

@garaud
Copy link
Contributor

@garaud garaud commented Nov 23, 2018

The Dask average function does not take into account the masked when the sum of weights is computed.

Add a unit test dedicated to this issue.

  • Tests added / passed
  • Passes flake8 dask

close #3846

@mrocklin
Copy link
Member

cc @dkillick in case you have time to review

Copy link

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks @garaud!

@garaud
Copy link
Contributor Author

garaud commented Nov 23, 2018

Thanks!

Travis failed on

from .ma import getmaskarray
E   ImportError: dask.array.ma requires numpy >= 1.11.2

The Python 3.5 build uses numpy 1.11.1 whereas the other build, e.g. 3.6 uses numpy 1.14.1

@DPeterK
Copy link

DPeterK commented Nov 23, 2018

Travis failed

It did... I'm thinking about that right now - I'll get back to you about that soon!

@DPeterK
Copy link

DPeterK commented Nov 23, 2018

So the failure is being caused by this import of dask.array.ma. The import causes the NumPy version difference to become a problem, which it wasn't before because dask.array.ma isn't imported anywhere else in the codebase.

I'm not quite sure what to do about this. The obvious easy fix is to increase the NumPy version in the failing test, but I'm not sure whether that version of NumPy was chosen for a good, but not immediately obvious, reason. Otherwise we need to remove the problem import line, but that too has problems: do we duplicate the getmaskarray functionality outside of dask.array.ma? Or, like NumPy, do we duplicate the average function to some extent in the masked array module?

@jcrist do you have any thoughts on this?

@mrocklin
Copy link
Member

Perhaps we guard the import on this line?

    wgt = wgt * (~getmaskarray(a))
if numpy.__version__ > ...;
    from . import ma
    wgt = wgt * ~ma.getmaskarray(a)
...

If the Numpy version number is less than that version then presumably the array isn't a dask masked array anyway, so this should be safe?

@garaud
Copy link
Contributor Author

garaud commented Nov 23, 2018

I think that code duplicate is not a good idea.

In that case, I think pragmatic: you can set the minor numpy version to 1.11.2 here https://github.com/dask/dask/blob/master/continuous_integration/travis/install.sh#L68 instead of 1.11 The latest digit is for bug fixes, this change should have a tiny impact.

For the other Python versions: 2.7, 3.6 and 3.7, the version of numpy is > 1.11.2. We can do the same for py35. I think we can try a numpy >= 1.11.2 in the travis build and let's see what's happened.

If the minor version of numpy is 1.11 for some good reasons and we don't want to upgrade to the minor version 1.11.2, I'll update the PR with the suggestion of @mrocklin

@mrocklin
Copy link
Member

mrocklin commented Nov 23, 2018 via email

@DPeterK
Copy link

DPeterK commented Nov 23, 2018

I think that code duplicate is not a good idea.

I agree - the only reason I suggested it at all is because that's what NumPy does, I think (ma has a separate average function).

@garaud
Copy link
Contributor Author

garaud commented Nov 23, 2018

If the numpy version is updated, we also have to update the setup.py here https://github.com/dask/dask/blob/master/setup.py#L11

This change implies removing the support for numpy < 1.11.2. I'm going to create a py3.5 env with this numpy version constraint and launch pytest. I hope there is a well-fitted versions combination between all dependencies.

@garaud
Copy link
Contributor Author

garaud commented Nov 23, 2018

All tests passed with:

  • python 3.5
  • numpy 1.11.3
  • pandas 0.19.2

@garaud
Copy link
Contributor Author

garaud commented Nov 26, 2018

Curious. For unknown reasons, some tests where there are some da.random.random calls failed. Or from sparse.COO.from_numpy functions. I really don't know.

wgt = wgt.swapaxes(-1, axis)

# if there is a masked array
wgt = wgt * (~getmaskarray(a))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this here, I'd rather that a thin wrapper around the existing functionality was implemented in the da.ma module. numpy's average function doesn't work for masked arrays, but np.ma.average does.

The best way to handle this might be to extract this full method into a helper method that takes an additional is_masked boolean (or something), and optionally applies this line.

Suggested change
wgt = wgt * (~getmaskarray(a))
if is_masked:
from .ma import getmaskarray
wgt = wgt * (~getmaskarray(a))

Then you'd define da.average with is_masked set to False, and da.ma.average with is_masked set to True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also fix the numpy version import issue, as this line would only execute if called from within da.ma.average.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the review process, I don't know if it's me or you which click on the "Resolve conversation" button. I left it as it is for now.

@jakirkham
Copy link
Member

I don't personally know why the minimum version is set the way it is. @jakirkham might have thoughts.

This was probably just easy to do at the time. If bumping the patch version helps, I'd recommend doing that.

@mrocklin
Copy link
Member

mrocklin commented Nov 26, 2018 via email

@garaud
Copy link
Contributor Author

garaud commented Nov 28, 2018

Thanks for your suggestion @jakirkham ! I'll update it this week.

Damien Garaud added 2 commits November 28, 2018 22:31
The Dask average function does not take into account the masked when the sum of weights is computed.
Sum of weights takes into account the masked array if exists.

close dask#3846
@garaud garaud force-pushed the fix-average-masked-weights branch from e991e7c to cfd2e01 Compare November 28, 2018 21:31
Damien Garaud added 2 commits November 28, 2018 22:50
add the _average wrapper which shared the average implementation
between da.average and da.ma.average
@garaud
Copy link
Contributor Author

garaud commented Nov 28, 2018

  • I moved the test dedicated to the average computation with masked array from test_routines.py to test_ma.py
  • The function da.average calls the _average wrapper with is_masked=False
  • The function da.ma.average calls the same wrapper with is_masked=True

I remove two commits from my branch about the numpy version and the authors.md file which was removed.

Copy link

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I like the implementation here with the thin wrapper 👍

@mrocklin
Copy link
Member

@jcrist is there anything left here that you think needs to be handled?

@jcrist
Copy link
Member

jcrist commented Nov 29, 2018

Nope, looks good to me. Thanks!

@jcrist jcrist merged commit 1cff482 into dask:master Nov 29, 2018
@garaud garaud deleted the fix-average-masked-weights branch November 29, 2018 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Weighted mean of masked dask array does not apply mask to weights

5 participants