Skip to content

Conversation

@pv
Copy link
Member

@pv pv commented Aug 5, 2015

Here's an initial stab at (i) exact solution for whether two arrays overlap in memory, and (ii) using it in array assignment overlap detection.

As the problem presumably is np-hard, the algorithm used for (i) scales exponentially with the number of dimensions of the arrays. By limiting the depth of the search, one obtains an improved may_share_memory that can deal with several common cases for strided arrays. I use the limited version with a fixed search size to get a (fairly OK) approximate answer.

Currently, overlap detection in Numpy appears to be used for making temporary copies for array assignment.

First step at addressing gh-6119, gh-1683 etc.

The next step would be to reuse the logic by which we make copies for assignment in also making temporary copies of ufunc input (and maybe updateifcopy for output) arguments.

This needs still somewhat more tests, and max_work should be tuned appropriately based on benchmarks. Perhaps it needs to depend on the size of the arrays, to balance the cost of making copies against cost of solving the integer program.

I'm aware there are several other algorithms for solving Diophantine equations in the literature, as the problem apparently also crops up in automatic theorem proving. However, the present one is fairly simple, low on memory requirements, and performance seems OK, so I'd prefer to just roll with it.

@pv
Copy link
Member Author

pv commented Aug 5, 2015

Correction: overlap detection is used in PyArray_AssignArray --- this seems to affect assignment, but I don't understand the exact code paths ATM.

@charris
Copy link
Member

charris commented Aug 5, 2015

This also seems to be related to reshaping arrays. Contiguous arrays, for instance, can be reshaped to 1-D. Arrays ones((5,5,5))[::2,5,5] and ones((5,5,5))[5,5,::2] can be reshaped to 2-D, etc. Reshaping to minimal dimensions looks to be a straight forward problem. Whether it is worth doing is another question.

@charris
Copy link
Member

charris commented Aug 5, 2015

The failures mostly look build related, although there might be some valid issues turned up in the -OO build.

@argriffing
Copy link
Contributor

The parameters section of the docstring says max_work default is 0 but the examples section gives different results for the default value vs. using 0 explicitly.

>>> np.may_share_memory(x[0,:], x[1,:])
True
>>> np.may_share_memory(x[0,:], x[1,:], max_work=0)
False

@pv
Copy link
Member Author

pv commented Aug 5, 2015 via email

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial style thing -- let's instantiate an RNG object instead of using the global RNG?

@njsmith
Copy link
Member

njsmith commented Aug 6, 2015

This is totally amazing, @pv!

I feel like there are two reasonable ways we could deploy this:

  • Option A: call may_share_memory with some reasonable work= bound; make a copy whenever they either share memory or we found we had too many candidates to check.
  • Option B: call definitely_share_memory [1] with some reasonable work= bound; issue a warning whenever the arrays definitely share memory, but never copy.

[1] yes, definitely_share_memory doesn't actually exist ATM, but it looks like it would be trivial to write based on this :-). Maybe we should even just have share_memory(a, b, work=...) which returns True or returns False or raises a TooHard exception.

I feel like the decision comes down to: given the distributions of arrays people are using in practice, can we choose a work= bound that is low enough that people basically don't notice the slowdown, but high enough that we can give exact answers on basically all reasonable cases? If so then we should definitely go with option A; if not then maybe we have to fall back on option B.

Does anyone have any ideas for how to go about figuring out the answer to the above question? The analysis of special cases in the big comment in the source makes me hopeful, but not 100% confident on its own. And I'm nervous about advertising to users that they can start relying on things like a[1:] += a[:-1] without being very sure that we actually can support that.

I guess one option would be to make a release that just prints out a big message every time it encounters some arrays where it can't answer the question with work=10 or something, and then wait to see how many people report tripping the warning?

@njsmith
Copy link
Member

njsmith commented Aug 6, 2015

I'd also like to see tests that explicitly check that various realistic cases actually are handled when the work bound is small. (Or maybe those are there already, but I didn't see any on a quick skim.)

One more thought: maybe the solution to the problem of wanting to support "power users" is to just add a ufunc kwarg check_overlap=True, which can be set to False as a "I know what I'm doing" relief valve. This would also give us a place to officially document that if you set this to false and your arrays do overlap, then we reserve the right to return totally arbitrary results, please stop making assumptions about iteration order.

@akhmerov
Copy link

akhmerov commented Aug 6, 2015

One more thought: maybe the solution to the problem of wanting to support "power users" is to just add a ufunc kwarg check_overlap=True, which can be set to False as a "I know what I'm doing" relief valve. This would also give us a place to officially document that if you set this to false and your arrays do overlap, then we reserve the right to return totally arbitrary results, please stop making assumptions about iteration order.

Should this even be possible? Numpy doesn't offer an easy way to turn off the bounds check, while that one is cheaper and more safe (omitting a check leads to behavior that doesn't depend on implementation details.

@seberg
Copy link
Member

seberg commented Aug 6, 2015

I really like this. My guess is, we should likely start of with a warning (or possible allow np.seterr logic) and then increase to error or copy for obvious cases.

Two things that must be checked after things (may) share memory (thinking about ufuncs here):

  1. Identical arrays work predictable (trivial)
  2. 1-D arrays work predictable (unbuffered), but we should give a warning for them I think. However, we probably need to hold back an error at least for the typical deprecation time. And then I would rather go via error for a while instead of braking working code by implementing copy logic.

We could silently copy in the longer run as a service for users who do not control the function input. I doubt a keyword is of much use. As far as I can think of right now, the 1-d case is probably the only one where you do not have to read the nditer docs/code to be sure for all cases.

@akhmerov
Copy link

akhmerov commented Aug 6, 2015

@seberg what's wrong about always silently giving the correct result? It wouldn't break existing code that accidentally works.

@akhmerov
Copy link

akhmerov commented Aug 6, 2015

About deployment: also numpy.dot(a, a.T, out=a) is affected, and should probably make a silent copy.

@seberg
Copy link
Member

seberg commented Aug 6, 2015

@akhmerov, I don't mind silent copy in almost all cases, but as Nathaniel says, some users might like a warning about the silent copy (though I doubt this is important).

What I mostly wanted to warn about is the 1-d overlapping in-place ufunc. Which gives reliable results that are not equivalent to copy. I do believe that for all other cases the result is not very reliable, but that does not rule out that nobody relies on it anyway. I would not be surprised if someone makes use of it, so giving a warning first seems the safer bet. However, the only thing I feel strongly about is the 1-d case, which I have done, and I am sure many many people did (I would surprised if SciPy does not have examples of it), and converting working code in broken code is dangerous.

@njsmith
Copy link
Member

njsmith commented Aug 6, 2015

@akhmerov:

Should this even be possible? Numpy doesn't offer an easy way to turn off the bounds check, while that one is cheaper and more safe (omitting a check leads to behavior that doesn't depend on implementation details.

Well, exactly :-). Bounds checks are quick and reliable, so we can just do them all the time and not worry; this is something that might take a lot of time or -- if we prevent that by enforcing a timeout -- then it might return unreliable results that trigger undesired behavior (like unnecessary copies or warnings).

@seberg:
Didn't we already break a bunch of these cases recently already and just tell people to deal with it? Are the 1d cases actually reliable across the last, say, 5 years of numpy releases? Or am I mixing that up with the higher dimensional cases?

I guess one migration strategy that would address your concern and also my concerns above, would be to initially, whenever we detect illegal overlap and think that copying would be a good idea, instead simply raise an error. This is reasonable, since all the illegal overlap cases are (by definition) ones where we have never guaranteed any particular behaviour, so it would both give people who are depending on this a good chance to catch their problems, and at the same time it would let us safely smoke out whether there are any problems with our overlap detection without committing to providing any particularly useful semantics. Then once we are confident that it works and we can support it, we could replace the errors by silent copies. (Or maybe even decide that explicit is better than implicit and leave the errors while telling people to make their own copies if that's what they want...)

And hmm, right, I hadn't thought about which cases (g)ufuncs should allow overlap, that's important to get down precisely :-). I guess for scalar (non-generalized) ufuncs, the rule is that out and in arrays are allowed to overlap iff they are identical? And for gufuncs, overlap is never legal (but as a special case optimization, if out and in are identical, then instead of copying the whole input at once you might instead buffer each input block as it is processed).

@akhmerov
Copy link

akhmerov commented Aug 6, 2015

About the work limit: is there a case of two arrays that fit into RAM and that have the amount of work anyhow comparable to e.g. a cost of applying a unary ufunc to the smaller of them? If I'm not mistaken ufuncs are expspace-hard for large numbers of dimensions.

@pv
Copy link
Member Author

pv commented Aug 6, 2015

I note code relying on non-copy semantics is brittle also in 1D cases
--- we have previously changed the behavior in some cases:
.
import numpy as np
x = np.arange(5)[::-1]
x[1:] += x[:-1]
print(np.version, x)
.
produces:
.
('1.6.2', array([ 4, 7, 9, 10, 10]))
('1.9.2', array([4, 7, 5, 3, 1]))
.
The only case that works the same in 1.9.x and 1.5.x is probably 1D with
positive stride. It seems to me non-copy semantics also for this case is
undefined behavior.
.
The output array from binary ufunc has size>=max(a.size,b.size) and the
ufunc fills all entries.

@seberg
Copy link
Member

seberg commented Aug 6, 2015

@pv, oh wow, how the heck did that change? It is not changed in 1.8. Frankly, if 1.9. was not so long ago, I would think this is a regression, I was quite certain that in the 1-d case the ufuncs sets the "do not reverse strides" flag for nditer (I had to add that into advanced indexing).

My guess is that it should be well defined (whatever that means) for all arrays that are obviously best iterated in C-order, the simple "different offset" 1-d case is just the only easily reliable one I can think of.

Maybe I am just paranoid because I used to write the arr[1:] += arr[:-1] trick, for a while (sure it is easily replaced, but silently wrong results....).

@pv
Copy link
Member Author

pv commented Aug 6, 2015 via email

@seberg
Copy link
Member

seberg commented Aug 6, 2015

I don't mind the copy in the long run, but I am still a bit wary of changing working code. If scipy does not do this at all, I guess that is a good hint that we can get away with it though.

I have tried to bisect when this changed, but did not quite manage. It all points to somewhere in my indexing changes, but there is no weird indexing going on here! You get the same result with:

import numpy as np
x = np.arange(5)[::-1]
np.add(x[1:], x[:-1], out=x[1:])
print(np.__version__, x)

@seberg
Copy link
Member

seberg commented Aug 6, 2015

:( it is my fault that this changed. I changed the speed special case because I misunderstood the logic during the indexing stuff:

--- a/numpy/core/src/private/lowlevel_strided_loops.h
+++ b/numpy/core/src/private/lowlevel_strided_loops.h
@@ -668,7 +668,7 @@ npy_bswap8_unaligned(char * x)
                                              PyArray_DIMS(arr2), \
                                              PyArray_NDIM(arr1)) && \
                         (PyArray_FLAGS(arr1)&(NPY_ARRAY_C_CONTIGUOUS| \
-                                      NPY_ARRAY_F_CONTIGUOUS)) & \
+                                      NPY_ARRAY_F_CONTIGUOUS)) == \
                                 (PyArray_FLAGS(arr2)&(NPY_ARRAY_C_CONTIGUOUS| \
                                               NPY_ARRAY_F_CONTIGUOUS)) \
                         )

This does not 100% convince me that it is OK to screw the 1-D in-place stuff, but at least that nobody noticed might make it more reasonable. But the question is if just having a warning for a bit isn't a safer bet.

Sorry for hijacking the thread a bit. It seems the fast special case path is a bit weird anyway, since two 1-d arrays do not hit it if one is not contiguous and the other is I think.

@pv
Copy link
Member Author

pv commented Aug 6, 2015

Here's a tool for checking at runtime if some code assumes non-copy semantics (detects issues however only if the data is such that results from the two operations differ), or calls ufuncs with overlapping in/out arrays: https://gist.github.com/pv/d1ecc8784c9d6a2a92bf

The outcome is that no ufuncs are called on arrays whose memory bounds overlap (max_work=0) in Scipy or Pandas test suites (although it did find a fixed bug, maybe you can guess what that was) --- the exceptions were the trivial a += 3 type operations. sklearn does the x[:,0] += x[:,1] thing, but not more --- they also have a bug in their tests (guess what bug). Numpy nditer tests have an in-place operation on a stride-0 array, and legendre polynomials have c[j - 1] += c[j], but there are no other cases of memory bound overlap. Astropy does not have ufunc calls with overlap (although 315/7981 tests don't run because they use ndarray subclasses etc brittle things). Pytables and statsmodels test suites don't appear to make overlapping ufunc calls.

@argriffing
Copy link
Contributor

I think this is really impressive.

@charris
Copy link
Member

charris commented Aug 6, 2015

Hmm, c[j - 1] += c[j], wonder where the overlap comes from?

Anything that finds bugs is good ;)

@seberg
Copy link
Member

seberg commented Aug 7, 2015

OK, btw. I retract all my doubts about 1-d. I had not realized that the SIMD optimizations (I suppose) messed this kind of trick up long ago for the most obvious "use cases".

@pv
Copy link
Member Author

pv commented Aug 7, 2015 via email

@aarchiba
Copy link
Contributor

aarchiba commented Aug 7, 2015

@charris Years ago I rewrote reshape to use views in the vast majority of cases where it actually works, There are a few corner cases, but not many. So I don't think that this is needed, necessarily.

@seberg
Copy link
Member

seberg commented Aug 7, 2015

NDiter also has such reshape logic in place to optimize iteration (with axes reordering, unlike reshape), not sure if it handles all the corner cases. It probably makes sense to add the "copy on overlap" logic to NDIter itself. On the other hand the "fast paths" also need the check (but maybe they could just drop through to NDIter if this happens).
Never mind reordering though (since I guess we need to detect things early on anyway), a copy on overlap flag which uses a temporary output operand in NDIter seems likely the right place to add the logic to me. Would have to dig the code to figure out where exactly the logic would have to be put…. Fast paths will be annoying :(, though we can maybe just add a cheap version to the fast path macros.

@pv
Copy link
Member Author

pv commented Aug 10, 2015

The NDIter changes should be a separate PR, probably not so trivial...

This one may be getting close to conclusion of some sort:

  • Added a fuzz test demonstrating that "easy" overlap cases are solved with max_work=1.
  • Considering one solution candidate appears to cost something like ~1000x over memcpy of one array element. Added an automatic work bound based on this measure, but the number could probably be tuned by more testing.
  • The worst case for arrays with no internal overlaps seems to be bounded by max_work <= max(a.size, b.size), but I don't have a proof. Just considering the for loops gives a (probably loose) bound ~ a.size*b.size.
  • It is possible to construct "hard" stride trick cases already for a.ndim=b.ndim=3 where solution takes ~ 1 minute when ufunc takes 2 seconds. Also for the one example here, max_work<=max(a.size,b.size).
  • No instances of non-"easy" array overlaps were found in scipy, pandas, scikit-learn, statsmodels, astropy, pytables test suites (apart from x += x.T bugs).

@pv
Copy link
Member Author

pv commented Aug 16, 2015

Added internal overlap solver. I'm not sure if we actually in the end want to use it for something, but it's there.

@seberg
Copy link
Member

seberg commented Aug 17, 2015

@pv, thinking about it, the ufunc machinery might be doing "illegal" things when operands are identical. That would explain why it seems to half work, but garbage data gets written to the output. Your changes could likely fix that so I will try it the next few days.

@pv
Copy link
Member Author

pv commented Aug 17, 2015

@seberg: ufunc_object.c:iterator_loop has this:
.
1479│ for (i = 0; i < nin; ++i) {
1480│ baseptrs[i] = PyArray_BYTES(op_it[i]);
1481│ }
1482├> for (i = nin; i < nop; ++i) {
1483│ baseptrs[i] = PyArray_BYTES(op[i]);
1484│ }
.
It ends up writing to the wrong array. Maybe it's just a typo, maybe
there's a reason for this... Changing it doesn't seem to break any
tests, and it makes things work for copying output.
(Perhaps it's just an oversight, cf 4faf10e which added copying of
read-inputs.)
.
The decision of what input to copy maybe should be made based on the
number of real elements in the operands. You copy the output if the sum
of elements in the overlapping inputs is larger than the number of
elements in the output. This should give reasonable performance when
broadcasting.
.
The logic could also perhaps be made smarter and just use the iterator
buffering instead if the arrays are small enough. I don't know the logic
here well enough ATM however.

@pv
Copy link
Member Author

pv commented Aug 17, 2015

Anyway, that changing this sort of thing doesn't require more work probably says good things about the structuring of the iterator/ufunc codebase.

@seberg
Copy link
Member

seberg commented Aug 17, 2015

@pv yeah, thanks! I was thinking this morning it was likely some ufunc stuff that made it break so badly. Yes, it seems pretty straight forward, though I am sure there are some details to still figure out. Anyway, I think we should put this in whenever it is ready, it should be used at various places soon enough (could create an issue to list some of the places that would be nice, i.e. iterator, dot, indexing and probably some other smaller things).

Actually, if the arrays are small, copying is faster then buffering I think, so only the opposite probably could make sense and that is not possible I think.

@pv
Copy link
Member Author

pv commented Aug 17, 2015

I think this PR is fairly feature complete now. Better algorithms could be added later on, but aren't probably really necessary given that the "easy" case covers the real-life use cases seen so far.

@pv pv force-pushed the sld branch 2 times, most recently from 66e5f16 to 6310a7f Compare August 17, 2015 17:10
@pv pv changed the title WIP: Better memory overlap detection Better memory overlap detection Aug 20, 2015
@charris
Copy link
Member

charris commented Aug 28, 2015

@pv Could you add a note to the 1.11.0 release notes? Under Improvements looks like a good place.

@pv
Copy link
Member Author

pv commented Aug 29, 2015

Rebased.

charris added a commit that referenced this pull request Aug 29, 2015
Better memory overlap detection
@charris charris merged commit 7001d61 into numpy:master Aug 29, 2015
@charris
Copy link
Member

charris commented Aug 29, 2015

Let's give this a shot. Thanks Pauli.

@juliantaylor
Copy link
Contributor

sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing npy_mul_with_overflow

@pv
Copy link
Member Author

pv commented Sep 11, 2015

I didn't try to benchmark it without overflow checking, so unclear. The typical runtime cost is however probably not big compared to the other stuff going on in ufunc dispatch preparation.

-------- Alkuperäinen viesti --------
Lähettäjä: Julian Taylor [email protected]
Päivämäärä: 11.09.2015 19.06 (GMT+02:00)
Saaja: numpy/numpy [email protected]
Kopio: Pauli Virtanen [email protected]
Aihe: Re: [numpy] Better memory overlap detection (#6166)

sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing npy_mul_with_overflow


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants