-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fixed mean() and moment() on sparse arrays #4525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has been added. However, sparse arrays created from ones_like cause sparse reductions to fail, as it would result in a dense array. To fix that, the arrays have first to be converted to dense before, which ends up requiring a new dispatcher for that purpose, and for already dense arrays must be bypassed, that is here implemented as numpy.ndarray.view() to simply return a shallow copy of the array. This commit fixes dask#4523 and adds the tests suggested in that issue.
|
@mrocklin please let me know what you think of this. I'm not particularly happy with the need of the |
|
I think that we shouldn't ever have to create the original ones array. Instead we should figure out what the output looks like and create that. In [2]: from dask.array.reductions import numel
In [3]: import numpy as np
In [4]: x = np.ones((2, 3, 4))
In [5]: numel(x, axis=(2,), keepdims=False)
Out[5]:
array([[4., 4., 4.],
[4., 4., 4.]])
In [6]: numel(x, axis=(2,), keepdims=False).shape
Out[6]: (2, 3)
In [7]: np.ones(shape=x.shape[:2]) * x.shape[2]
Out[7]:
array([[4., 4., 4.],
[4., 4., 4.]])This gets a little bit tricky for complex values of |
|
And actually, going even further, we may not need to track numel at all. I think that we may be able to just track a single integer throughout all of the cases where numel is used. If so, this would be preferable. So in general I think that instead of improving numel we should take a hard look at how it is used today, and see if we can come up with a better algorithm (if you're up for that). |
|
I agree that not creating a the original ones array would be a better solution. However, if I'm not missing something, we still can't create a |
|
It may be that we never need to create the array of ones at all. It may be that we're passing way more information than we need to. I think that we only need the single value stored in the array, along with the shape of the array (which happens to be the same as the shape of the other arrays currently passed in the same dictionary). |
|
@mrocklin I changed the This still doesn't solve returning only the value and shape, but it's a start. Also, I'm sure the |
|
And I almost forgot, but the |
|
This looks great to me. I'm glad to see this come together. There appears to be a failure in the Python 2.7 build, it looks like behavior might have changed in one of the relevant libraries. |
|
This looks great to me. Thanks @pentschev ! Merging. |
|
Actually, can we safely remove the |
Yes, I thought I would leave it as it could be useful, but probably creating a |
|
So, I think that we can avoid implementing the |
|
I hope you don't mind, but I pushed a commit to your branch that removes the |
That's also what I had in mind.
No worries, I see you marked it xfail for now. Hopefully when |
|
You can remove the |
|
@hameerabbasi that was fast, thanks! Didn't push yet, but now it fails when calling |
|
That will have to wait for a day or so. Of course, you're free to make a PR and I'll review it. |
|
I can send a PR, the question was more if that's something we should do, but I got my answer now. :) |
dask/array/reductions.py
Outdated
| diff = A - u | ||
| todense = todense_lookup.dispatch(type(diff)) | ||
| xs = [sum(todense(A - u)**i, dtype=dtype, **kwargs) for i in range(2, order + 1)] | ||
| xs = [((A - u)**i).sum(dtype=dtype, **kwargs) for i in range(2, order + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrocklin this is the line that causes nanvar test to fail. Please note that sum() here is not Python's nor NumPy's sum(), but it's one of moment_chunk()'s arguments. I'm now wondering third things:
- Should we rename the
sumargument to something likesum_functo avoid confusion? - Is it possible there's a bug in the
sum()function indask/array/reductions.py? I don't think they should give us different results anyway. - Did we accidentally changed behavior in some reduction functions when
sum()was replaced by_concatenate2().sum()?
|
As @hameerabbasi pointed out, we can now remove the xfail, but for that we'll need to use upstream sparse or wait for a new release. How should we handle that? |
|
Failures are test_parquet, therefore, not related to this PR. @jakirkham are we good to merge this? |
|
Everything passes again. |
|
Earlier today, @jakirkham and I discussed this PR offline, and I understood what was his intent regarding upstream libraries. I therefore did now what we agreed to try, which is to reenable the development build that uses upstream libraries. |
|
Thanks @pentschev! LGTM @jcrist and @martindurant, thoughts? |
|
Am seeing a few errors coming from Sparse/Numba in the dev tests that look like this |
|
The development builds often have periodic failures due to upstream issues that have nothing to do with us. As such, I'd prefer to either:
|
|
I've reported the issue upstream in Numba (numba/numba#3953) |
|
I agree that we should find a middle-ground, like @jcrist suggested. I would really like to get this PR merged soon, we're starting to stack different problems, preventing fixes from getting merged (like the FFT MKL workaround, which is by the way already stacked here, which isn't nice). So my proposal is the following:
After both items above are done, we can get this PR merged before it becomes a big snowball in an attempt to fix several partially-unrelated problems into one. Do you agree with my proposal @jakirkham @mrocklin @jcrist @hameerabbasi ? |
|
Thanks @jcrist. This seems like a reasonable request to me. @pentschev, any thoughts on this? (Sorry Peter. GitHub didn't show your latest comment.). Thanks for following up on that, @hameerabbasi. I see that issue is closed. What is the takeaway? |
|
It was a temporary break on Numba's end. |
|
Is there a known working version of Numba? Should we pin it to something? |
|
It was only in |
|
Sorry I think I'm missing something. How would that cause the CI failure we saw? Are we getting a dev version of Numba somewhere that I missed? |
|
@jakirkham I agree with @jcrist suggestions for that particular test build. In general I tend not to like this sort of test because it's likely to get ignored/forgotten, even when they're necessary. However, I'd still like to test against sparse upstream for every PR for a while, to ensure this keeps on working well. Nevertheless, what do you think of my proposal @jakirkham from my last comment? I think we can and should think through the way we build, but I would prefer to have another PR for that as it's likely to be an issue that will have diverging opinions while holding off merging this PR. |
|
I just checked, apparently |
|
|
|
Anyway, I isolated the issue (as it was essentially a compilation issue, which is easy to isolate) and made sure it's not there in Numba |
Thanks so much for doing that @hameerabbasi. One alternative we have now is to try using Numba master for |
This reverts commit 1705689.
|
Since there were no objections to my proposal, I reverted the last commit so that in this PR we can focus on the fixes that we aim to have while moving the discussion on how to move forward with the test of upstream libraries to #4696. Please, let me know if there are any other suggestions for the fixes here, otherwise, I think this was already good for merging a couple of weeks ago. |
|
Thanks @pentschev! |
* Fixed mean() and moment() on sparse arrays For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has been added. However, sparse arrays created from ones_like cause sparse reductions to fail, as it would result in a dense array. To fix that, the arrays have first to be converted to dense before, which ends up requiring a new dispatcher for that purpose, and for already dense arrays must be bypassed, that is here implemented as numpy.ndarray.view() to simply return a shallow copy of the array. This commit fixes dask#4523 and adds the tests suggested in that issue. * Reductions' numel() won't create ones() for unmasked arrays * Add tests for new numel() implementation * Fix numel() test, previously failing in Python 2.7. * Remove ones_like_lookup() for sparse matrices * remove todense_lookup * Call correct sum() function in moment_chunk() * Remove xfail from mean() sparse test * Add sparse std() test back * Test sparse moment() * Test sparse var() * Build also against sparse upstream * Fix condition for CI upstream sparse installation * Attempt to fix upstream sparse installation once more * Enable __array_function__ in Python 3.7 build * Remove leftover export from run_tests.sh * Workaround for mkl.fft failures in test_array_function.py * Minor reductions readability/code consistency changes * Increase coverage of numel() * Remove unnecessary for loop in numel() test * Reenable development build, uses upstream libraries * Revert "Reenable development build, uses upstream libraries" This reverts commit 1705689.
For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.
This commit fixes #4523 and adds the tests
suggested in that issue.
@mrocklin
flake8 dask