-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Better memory overlap detection #6166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Correction: overlap detection is used in |
|
This also seems to be related to reshaping arrays. Contiguous arrays, for instance, can be reshaped to 1-D. Arrays |
|
The failures mostly look build related, although there might be some valid issues turned up in the |
|
The parameters section of the docstring says >>> np.may_share_memory(x[0,:], x[1,:])
True
>>> np.may_share_memory(x[0,:], x[1,:], max_work=0)
False |
|
Fixed the build/doc issues. The three build systems are a bit of a
PITA... Will squash when everything else is finished.
|
numpy/core/tests/test_mem_overlap.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial style thing -- let's instantiate an RNG object instead of using the global RNG?
|
This is totally amazing, @pv! I feel like there are two reasonable ways we could deploy this:
[1] yes, I feel like the decision comes down to: given the distributions of arrays people are using in practice, can we choose a Does anyone have any ideas for how to go about figuring out the answer to the above question? The analysis of special cases in the big comment in the source makes me hopeful, but not 100% confident on its own. And I'm nervous about advertising to users that they can start relying on things like I guess one option would be to make a release that just prints out a big message every time it encounters some arrays where it can't answer the question with |
|
I'd also like to see tests that explicitly check that various realistic cases actually are handled when the work bound is small. (Or maybe those are there already, but I didn't see any on a quick skim.) One more thought: maybe the solution to the problem of wanting to support "power users" is to just add a ufunc kwarg |
Should this even be possible? Numpy doesn't offer an easy way to turn off the bounds check, while that one is cheaper and more safe (omitting a check leads to behavior that doesn't depend on implementation details. |
|
I really like this. My guess is, we should likely start of with a warning (or possible allow Two things that must be checked after things (may) share memory (thinking about ufuncs here):
We could silently copy in the longer run as a service for users who do not control the function input. I doubt a keyword is of much use. As far as I can think of right now, the 1-d case is probably the only one where you do not have to read the nditer docs/code to be sure for all cases. |
|
@seberg what's wrong about always silently giving the correct result? It wouldn't break existing code that accidentally works. |
|
About deployment: also |
|
@akhmerov, I don't mind silent copy in almost all cases, but as Nathaniel says, some users might like a warning about the silent copy (though I doubt this is important). What I mostly wanted to warn about is the 1-d overlapping in-place ufunc. Which gives reliable results that are not equivalent to copy. I do believe that for all other cases the result is not very reliable, but that does not rule out that nobody relies on it anyway. I would not be surprised if someone makes use of it, so giving a warning first seems the safer bet. However, the only thing I feel strongly about is the 1-d case, which I have done, and I am sure many many people did (I would surprised if SciPy does not have examples of it), and converting working code in broken code is dangerous. |
Well, exactly :-). Bounds checks are quick and reliable, so we can just do them all the time and not worry; this is something that might take a lot of time or -- if we prevent that by enforcing a timeout -- then it might return unreliable results that trigger undesired behavior (like unnecessary copies or warnings). @seberg: I guess one migration strategy that would address your concern and also my concerns above, would be to initially, whenever we detect illegal overlap and think that copying would be a good idea, instead simply raise an error. This is reasonable, since all the illegal overlap cases are (by definition) ones where we have never guaranteed any particular behaviour, so it would both give people who are depending on this a good chance to catch their problems, and at the same time it would let us safely smoke out whether there are any problems with our overlap detection without committing to providing any particularly useful semantics. Then once we are confident that it works and we can support it, we could replace the errors by silent copies. (Or maybe even decide that explicit is better than implicit and leave the errors while telling people to make their own copies if that's what they want...) And hmm, right, I hadn't thought about which cases (g)ufuncs should allow overlap, that's important to get down precisely :-). I guess for scalar (non-generalized) ufuncs, the rule is that out and in arrays are allowed to overlap iff they are identical? And for gufuncs, overlap is never legal (but as a special case optimization, if out and in are identical, then instead of copying the whole input at once you might instead buffer each input block as it is processed). |
|
About the work limit: is there a case of two arrays that fit into RAM and that have the amount of work anyhow comparable to e.g. a cost of applying a unary ufunc to the smaller of them? If I'm not mistaken ufuncs are expspace-hard for large numbers of dimensions. |
|
I note code relying on non-copy semantics is brittle also in 1D cases |
|
@pv, oh wow, how the heck did that change? It is not changed in 1.8. Frankly, if 1.9. was not so long ago, I would think this is a regression, I was quite certain that in the 1-d case the ufuncs sets the "do not reverse strides" flag for nditer (I had to add that into advanced indexing). My guess is that it should be well defined (whatever that means) for all arrays that are obviously best iterated in C-order, the simple "different offset" 1-d case is just the only easily reliable one I can think of. Maybe I am just paranoid because I used to write the |
|
I think having `a += b` and `a = a + b` produce results equivalent
modulo copying would be the best thing to do.
.
The argument so far for not doing this is to not break existing code.
However, we have silently broken such non-copy semantics several times
in previous releases, and AFAICS only the 1D unbuffered positive stride
case is unchanged.
.
It should be straightforward to monkeypatch numpy ufuncs+binops so that
such undefined behavior can be detected. I would guess such tricks are
uncommon, and e.g. there are none in Scipy. However, this is easily checked.
|
|
I don't mind the copy in the long run, but I am still a bit wary of changing working code. If scipy does not do this at all, I guess that is a good hint that we can get away with it though. I have tried to bisect when this changed, but did not quite manage. It all points to somewhere in my indexing changes, but there is no weird indexing going on here! You get the same result with: |
|
:( it is my fault that this changed. I changed the speed special case because I misunderstood the logic during the indexing stuff: --- a/numpy/core/src/private/lowlevel_strided_loops.h
+++ b/numpy/core/src/private/lowlevel_strided_loops.h
@@ -668,7 +668,7 @@ npy_bswap8_unaligned(char * x)
PyArray_DIMS(arr2), \
PyArray_NDIM(arr1)) && \
(PyArray_FLAGS(arr1)&(NPY_ARRAY_C_CONTIGUOUS| \
- NPY_ARRAY_F_CONTIGUOUS)) & \
+ NPY_ARRAY_F_CONTIGUOUS)) == \
(PyArray_FLAGS(arr2)&(NPY_ARRAY_C_CONTIGUOUS| \
NPY_ARRAY_F_CONTIGUOUS)) \
)This does not 100% convince me that it is OK to screw the 1-D in-place stuff, but at least that nobody noticed might make it more reasonable. But the question is if just having a warning for a bit isn't a safer bet. Sorry for hijacking the thread a bit. It seems the fast special case path is a bit weird anyway, since two 1-d arrays do not hit it if one is not contiguous and the other is I think. |
|
Here's a tool for checking at runtime if some code assumes non-copy semantics (detects issues however only if the data is such that results from the two operations differ), or calls ufuncs with overlapping in/out arrays: https://gist.github.com/pv/d1ecc8784c9d6a2a92bf The outcome is that no ufuncs are called on arrays whose memory bounds overlap (max_work=0) in Scipy or Pandas test suites (although it did find a fixed bug, maybe you can guess what that was) --- the exceptions were the trivial |
|
I think this is really impressive. |
|
Hmm, Anything that finds bugs is good ;) |
|
OK, btw. I retract all my doubts about 1-d. I had not realized that the SIMD optimizations (I suppose) messed this kind of trick up long ago for the most obvious "use cases". |
|
Ok, indeed,
.
x = np.arange(16).astype(float); x[1:] += x[:-1]
.
produces different results on 1.6.2 and 1.9.2.
|
|
@charris Years ago I rewrote reshape to use views in the vast majority of cases where it actually works, There are a few corner cases, but not many. So I don't think that this is needed, necessarily. |
|
NDiter also has such reshape logic in place to optimize iteration (with axes reordering, unlike reshape), not sure if it handles all the corner cases. It probably makes sense to add the "copy on overlap" logic to NDIter itself. On the other hand the "fast paths" also need the check (but maybe they could just drop through to NDIter if this happens). |
|
The NDIter changes should be a separate PR, probably not so trivial... This one may be getting close to conclusion of some sort:
|
|
Added internal overlap solver. I'm not sure if we actually in the end want to use it for something, but it's there. |
|
@pv, thinking about it, the ufunc machinery might be doing "illegal" things when operands are identical. That would explain why it seems to half work, but garbage data gets written to the output. Your changes could likely fix that so I will try it the next few days. |
|
@seberg: ufunc_object.c:iterator_loop has this: |
|
Anyway, that changing this sort of thing doesn't require more work probably says good things about the structuring of the iterator/ufunc codebase. |
|
@pv yeah, thanks! I was thinking this morning it was likely some ufunc stuff that made it break so badly. Yes, it seems pretty straight forward, though I am sure there are some details to still figure out. Anyway, I think we should put this in whenever it is ready, it should be used at various places soon enough (could create an issue to list some of the places that would be nice, i.e. iterator, dot, indexing and probably some other smaller things). Actually, if the arrays are small, copying is faster then buffering I think, so only the opposite probably could make sense and that is not possible I think. |
|
I think this PR is fairly feature complete now. Better algorithms could be added later on, but aren't probably really necessary given that the "easy" case covers the real-life use cases seen so far. |
66e5f16 to
6310a7f
Compare
|
@pv Could you add a note to the 1.11.0 release notes? Under Improvements looks like a good place. |
These are used for testing the diophantine solver and memory overlap codes.
|
Rebased. |
Better memory overlap detection
|
Let's give this a shot. Thanks Pauli. |
|
sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing |
|
I didn't try to benchmark it without overflow checking, so unclear. The typical runtime cost is however probably not big compared to the other stuff going on in ufunc dispatch preparation. -------- Alkuperäinen viesti -------- sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing npy_mul_with_overflow — |
Here's an initial stab at (i) exact solution for whether two arrays overlap in memory, and (ii) using it in array assignment overlap detection.
As the problem presumably is np-hard, the algorithm used for (i) scales exponentially with the number of dimensions of the arrays. By limiting the depth of the search, one obtains an improved
may_share_memorythat can deal with several common cases for strided arrays. I use the limited version with a fixed search size to get a (fairly OK) approximate answer.Currently, overlap detection in Numpy appears to be used for making temporary copies for array assignment.
First step at addressing gh-6119, gh-1683 etc.
The next step would be to reuse the logic by which we make copies for assignment in also making temporary copies of ufunc input (and maybe updateifcopy for output) arguments.
This needs still somewhat more tests, and
max_workshould be tuned appropriately based on benchmarks. Perhaps it needs to depend on the size of the arrays, to balance the cost of making copies against cost of solving the integer program.I'm aware there are several other algorithms for solving Diophantine equations in the literature, as the problem apparently also crops up in automatic theorem proving. However, the present one is fairly simple, low on memory requirements, and performance seems OK, so I'd prefer to just roll with it.