Better memory overlap detection #6166

pv · 2015-08-05T14:56:05Z

Here's an initial stab at (i) exact solution for whether two arrays overlap in memory, and (ii) using it in array assignment overlap detection.

As the problem presumably is np-hard, the algorithm used for (i) scales exponentially with the number of dimensions of the arrays. By limiting the depth of the search, one obtains an improved may_share_memory that can deal with several common cases for strided arrays. I use the limited version with a fixed search size to get a (fairly OK) approximate answer.

Currently, overlap detection in Numpy appears to be used for making temporary copies for array assignment.

First step at addressing gh-6119, gh-1683 etc.

The next step would be to reuse the logic by which we make copies for assignment in also making temporary copies of ufunc input (and maybe updateifcopy for output) arguments.

This needs still somewhat more tests, and max_work should be tuned appropriately based on benchmarks. Perhaps it needs to depend on the size of the arrays, to balance the cost of making copies against cost of solving the integer program.

I'm aware there are several other algorithms for solving Diophantine equations in the literature, as the problem apparently also crops up in automatic theorem proving. However, the present one is fairly simple, low on memory requirements, and performance seems OK, so I'd prefer to just roll with it.

pv · 2015-08-05T15:10:15Z

Correction: overlap detection is used in PyArray_AssignArray --- this seems to affect assignment, but I don't understand the exact code paths ATM.

charris · 2015-08-05T15:24:43Z

This also seems to be related to reshaping arrays. Contiguous arrays, for instance, can be reshaped to 1-D. Arrays ones((5,5,5))[::2,5,5] and ones((5,5,5))[5,5,::2] can be reshaped to 2-D, etc. Reshaping to minimal dimensions looks to be a straight forward problem. Whether it is worth doing is another question.

charris · 2015-08-05T17:30:11Z

The failures mostly look build related, although there might be some valid issues turned up in the -OO build.

argriffing · 2015-08-05T18:39:38Z

The parameters section of the docstring says max_work default is 0 but the examples section gives different results for the default value vs. using 0 explicitly.

>>> np.may_share_memory(x[0,:], x[1,:])
True
>>> np.may_share_memory(x[0,:], x[1,:], max_work=0)
False

pv · 2015-08-05T21:55:09Z

Fixed the build/doc issues. The three build systems are a bit of a PITA... Will squash when everything else is finished.

njsmith · 2015-08-06T00:48:46Z

numpy/core/tests/test_mem_overlap.py

Trivial style thing -- let's instantiate an RNG object instead of using the global RNG?

njsmith · 2015-08-06T01:03:29Z

This is totally amazing, @pv!

I feel like there are two reasonable ways we could deploy this:

Option A: call may_share_memory with some reasonable work= bound; make a copy whenever they either share memory or we found we had too many candidates to check.
Option B: call definitely_share_memory [1] with some reasonable work= bound; issue a warning whenever the arrays definitely share memory, but never copy.

[1] yes, definitely_share_memory doesn't actually exist ATM, but it looks like it would be trivial to write based on this :-). Maybe we should even just have share_memory(a, b, work=...) which returns True or returns False or raises a TooHard exception.

I feel like the decision comes down to: given the distributions of arrays people are using in practice, can we choose a work= bound that is low enough that people basically don't notice the slowdown, but high enough that we can give exact answers on basically all reasonable cases? If so then we should definitely go with option A; if not then maybe we have to fall back on option B.

Does anyone have any ideas for how to go about figuring out the answer to the above question? The analysis of special cases in the big comment in the source makes me hopeful, but not 100% confident on its own. And I'm nervous about advertising to users that they can start relying on things like a[1:] += a[:-1] without being very sure that we actually can support that.

I guess one option would be to make a release that just prints out a big message every time it encounters some arrays where it can't answer the question with work=10 or something, and then wait to see how many people report tripping the warning?

njsmith · 2015-08-06T01:08:09Z

I'd also like to see tests that explicitly check that various realistic cases actually are handled when the work bound is small. (Or maybe those are there already, but I didn't see any on a quick skim.)

One more thought: maybe the solution to the problem of wanting to support "power users" is to just add a ufunc kwarg check_overlap=True, which can be set to False as a "I know what I'm doing" relief valve. This would also give us a place to officially document that if you set this to false and your arrays do overlap, then we reserve the right to return totally arbitrary results, please stop making assumptions about iteration order.

akhmerov · 2015-08-06T07:29:56Z

One more thought: maybe the solution to the problem of wanting to support "power users" is to just add a ufunc kwarg check_overlap=True, which can be set to False as a "I know what I'm doing" relief valve. This would also give us a place to officially document that if you set this to false and your arrays do overlap, then we reserve the right to return totally arbitrary results, please stop making assumptions about iteration order.

Should this even be possible? Numpy doesn't offer an easy way to turn off the bounds check, while that one is cheaper and more safe (omitting a check leads to behavior that doesn't depend on implementation details.

seberg · 2015-08-06T07:41:17Z

I really like this. My guess is, we should likely start of with a warning (or possible allow np.seterr logic) and then increase to error or copy for obvious cases.

Two things that must be checked after things (may) share memory (thinking about ufuncs here):

Identical arrays work predictable (trivial)
1-D arrays work predictable (unbuffered), but we should give a warning for them I think. However, we probably need to hold back an error at least for the typical deprecation time. And then I would rather go via error for a while instead of braking working code by implementing copy logic.

We could silently copy in the longer run as a service for users who do not control the function input. I doubt a keyword is of much use. As far as I can think of right now, the 1-d case is probably the only one where you do not have to read the nditer docs/code to be sure for all cases.

akhmerov · 2015-08-06T07:45:10Z

@seberg what's wrong about always silently giving the correct result? It wouldn't break existing code that accidentally works.

akhmerov · 2015-08-06T08:01:53Z

About deployment: also numpy.dot(a, a.T, out=a) is affected, and should probably make a silent copy.

seberg · 2015-08-06T08:13:58Z

@akhmerov, I don't mind silent copy in almost all cases, but as Nathaniel says, some users might like a warning about the silent copy (though I doubt this is important).

What I mostly wanted to warn about is the 1-d overlapping in-place ufunc. Which gives reliable results that are not equivalent to copy. I do believe that for all other cases the result is not very reliable, but that does not rule out that nobody relies on it anyway. I would not be surprised if someone makes use of it, so giving a warning first seems the safer bet. However, the only thing I feel strongly about is the 1-d case, which I have done, and I am sure many many people did (I would surprised if SciPy does not have examples of it), and converting working code in broken code is dangerous.

njsmith · 2015-08-06T08:20:33Z

@akhmerov:

Should this even be possible? Numpy doesn't offer an easy way to turn off the bounds check, while that one is cheaper and more safe (omitting a check leads to behavior that doesn't depend on implementation details.

Well, exactly :-). Bounds checks are quick and reliable, so we can just do them all the time and not worry; this is something that might take a lot of time or -- if we prevent that by enforcing a timeout -- then it might return unreliable results that trigger undesired behavior (like unnecessary copies or warnings).

@seberg:
Didn't we already break a bunch of these cases recently already and just tell people to deal with it? Are the 1d cases actually reliable across the last, say, 5 years of numpy releases? Or am I mixing that up with the higher dimensional cases?

I guess one migration strategy that would address your concern and also my concerns above, would be to initially, whenever we detect illegal overlap and think that copying would be a good idea, instead simply raise an error. This is reasonable, since all the illegal overlap cases are (by definition) ones where we have never guaranteed any particular behaviour, so it would both give people who are depending on this a good chance to catch their problems, and at the same time it would let us safely smoke out whether there are any problems with our overlap detection without committing to providing any particularly useful semantics. Then once we are confident that it works and we can support it, we could replace the errors by silent copies. (Or maybe even decide that explicit is better than implicit and leave the errors while telling people to make their own copies if that's what they want...)

And hmm, right, I hadn't thought about which cases (g)ufuncs should allow overlap, that's important to get down precisely :-). I guess for scalar (non-generalized) ufuncs, the rule is that out and in arrays are allowed to overlap iff they are identical? And for gufuncs, overlap is never legal (but as a special case optimization, if out and in are identical, then instead of copying the whole input at once you might instead buffer each input block as it is processed).

akhmerov · 2015-08-06T08:26:46Z

About the work limit: is there a case of two arrays that fit into RAM and that have the amount of work anyhow comparable to e.g. a cost of applying a unary ufunc to the smaller of them? If I'm not mistaken ufuncs are expspace-hard for large numbers of dimensions.

pv · 2015-08-06T08:58:30Z

I note code relying on non-copy semantics is brittle also in 1D cases
--- we have previously changed the behavior in some cases:
.
import numpy as np
x = np.arange(5)[::-1]
x[1:] += x[:-1]
print(np.version, x)
.
produces:
.
('1.6.2', array([ 4, 7, 9, 10, 10]))
('1.9.2', array([4, 7, 5, 3, 1]))
.
The only case that works the same in 1.9.x and 1.5.x is probably 1D with
positive stride. It seems to me non-copy semantics also for this case is
undefined behavior.
.
The output array from binary ufunc has size>=max(a.size,b.size) and the
ufunc fills all entries.

seberg · 2015-08-06T09:24:54Z

@pv, oh wow, how the heck did that change? It is not changed in 1.8. Frankly, if 1.9. was not so long ago, I would think this is a regression, I was quite certain that in the 1-d case the ufuncs sets the "do not reverse strides" flag for nditer (I had to add that into advanced indexing).

My guess is that it should be well defined (whatever that means) for all arrays that are obviously best iterated in C-order, the simple "different offset" 1-d case is just the only easily reliable one I can think of.

Maybe I am just paranoid because I used to write the arr[1:] += arr[:-1] trick, for a while (sure it is easily replaced, but silently wrong results....).

pv · 2015-08-06T09:47:07Z

I think having `a += b` and `a = a + b` produce results equivalent modulo copying would be the best thing to do. . The argument so far for not doing this is to not break existing code. However, we have silently broken such non-copy semantics several times in previous releases, and AFAICS only the 1D unbuffered positive stride case is unchanged. . It should be straightforward to monkeypatch numpy ufuncs+binops so that such undefined behavior can be detected. I would guess such tricks are uncommon, and e.g. there are none in Scipy. However, this is easily checked.

seberg · 2015-08-06T10:34:06Z

I don't mind the copy in the long run, but I am still a bit wary of changing working code. If scipy does not do this at all, I guess that is a good hint that we can get away with it though.

I have tried to bisect when this changed, but did not quite manage. It all points to somewhere in my indexing changes, but there is no weird indexing going on here! You get the same result with:

import numpy as np
x = np.arange(5)[::-1]
np.add(x[1:], x[:-1], out=x[1:])
print(np.__version__, x)

seberg · 2015-08-06T15:52:39Z

:( it is my fault that this changed. I changed the speed special case because I misunderstood the logic during the indexing stuff:

--- a/numpy/core/src/private/lowlevel_strided_loops.h
+++ b/numpy/core/src/private/lowlevel_strided_loops.h
@@ -668,7 +668,7 @@ npy_bswap8_unaligned(char * x)
                                              PyArray_DIMS(arr2), \
                                              PyArray_NDIM(arr1)) && \
                         (PyArray_FLAGS(arr1)&(NPY_ARRAY_C_CONTIGUOUS| \
-                                      NPY_ARRAY_F_CONTIGUOUS)) & \
+                                      NPY_ARRAY_F_CONTIGUOUS)) == \
                                 (PyArray_FLAGS(arr2)&(NPY_ARRAY_C_CONTIGUOUS| \
                                               NPY_ARRAY_F_CONTIGUOUS)) \
                         )

This does not 100% convince me that it is OK to screw the 1-D in-place stuff, but at least that nobody noticed might make it more reasonable. But the question is if just having a warning for a bit isn't a safer bet.

Sorry for hijacking the thread a bit. It seems the fast special case path is a bit weird anyway, since two 1-d arrays do not hit it if one is not contiguous and the other is I think.

pv · 2015-08-06T19:12:12Z

Here's a tool for checking at runtime if some code assumes non-copy semantics (detects issues however only if the data is such that results from the two operations differ), or calls ufuncs with overlapping in/out arrays: https://gist.github.com/pv/d1ecc8784c9d6a2a92bf

The outcome is that no ufuncs are called on arrays whose memory bounds overlap (max_work=0) in Scipy or Pandas test suites (although it did find a fixed bug, maybe you can guess what that was) --- the exceptions were the trivial a += 3 type operations. sklearn does the x[:,0] += x[:,1] thing, but not more --- they also have a bug in their tests (guess what bug). Numpy nditer tests have an in-place operation on a stride-0 array, and legendre polynomials have c[j - 1] += c[j], but there are no other cases of memory bound overlap. Astropy does not have ufunc calls with overlap (although 315/7981 tests don't run because they use ndarray subclasses etc brittle things). Pytables and statsmodels test suites don't appear to make overlapping ufunc calls.

argriffing · 2015-08-06T19:54:04Z

I think this is really impressive.

charris · 2015-08-06T22:24:03Z

Hmm, c[j - 1] += c[j], wonder where the overlap comes from?

Anything that finds bugs is good ;)

seberg · 2015-08-07T08:01:00Z

OK, btw. I retract all my doubts about 1-d. I had not realized that the SIMD optimizations (I suppose) messed this kind of trick up long ago for the most obvious "use cases".

pv · 2015-08-07T08:29:59Z

Ok, indeed, . x = np.arange(16).astype(float); x[1:] += x[:-1] . produces different results on 1.6.2 and 1.9.2.

aarchiba · 2015-08-07T08:37:11Z

@charris Years ago I rewrote reshape to use views in the vast majority of cases where it actually works, There are a few corner cases, but not many. So I don't think that this is needed, necessarily.

seberg · 2015-08-07T08:51:47Z

NDiter also has such reshape logic in place to optimize iteration (with axes reordering, unlike reshape), not sure if it handles all the corner cases. It probably makes sense to add the "copy on overlap" logic to NDIter itself. On the other hand the "fast paths" also need the check (but maybe they could just drop through to NDIter if this happens).
Never mind reordering though (since I guess we need to detect things early on anyway), a copy on overlap flag which uses a temporary output operand in NDIter seems likely the right place to add the logic to me. Would have to dig the code to figure out where exactly the logic would have to be put…. Fast paths will be annoying :(, though we can maybe just add a cheap version to the fast path macros.

pv · 2015-08-10T13:37:09Z

The NDIter changes should be a separate PR, probably not so trivial...

This one may be getting close to conclusion of some sort:

Added a fuzz test demonstrating that "easy" overlap cases are solved with max_work=1.
Considering one solution candidate appears to cost something like ~1000x over memcpy of one array element. Added an automatic work bound based on this measure, but the number could probably be tuned by more testing.
The worst case for arrays with no internal overlaps seems to be bounded by max_work <= max(a.size, b.size), but I don't have a proof. Just considering the for loops gives a (probably loose) bound ~ a.size*b.size.
It is possible to construct "hard" stride trick cases already for a.ndim=b.ndim=3 where solution takes ~ 1 minute when ufunc takes 2 seconds. Also for the one example here, max_work<=max(a.size,b.size).
No instances of non-"easy" array overlaps were found in scipy, pandas, scikit-learn, statsmodels, astropy, pytables test suites ~~(apart from x += x.T bugs)~~.

pv · 2015-08-16T23:37:29Z

Added internal overlap solver. I'm not sure if we actually in the end want to use it for something, but it's there.

seberg · 2015-08-17T07:59:44Z

@pv, thinking about it, the ufunc machinery might be doing "illegal" things when operands are identical. That would explain why it seems to half work, but garbage data gets written to the output. Your changes could likely fix that so I will try it the next few days.

pv · 2015-08-17T08:53:27Z

@seberg: ufunc_object.c:iterator_loop has this:
.
1479│ for (i = 0; i < nin; ++i) {
1480│ baseptrs[i] = PyArray_BYTES(op_it[i]);
1481│ }
1482├> for (i = nin; i < nop; ++i) {
1483│ baseptrs[i] = PyArray_BYTES(op[i]);
1484│ }
.
It ends up writing to the wrong array. Maybe it's just a typo, maybe
there's a reason for this... Changing it doesn't seem to break any
tests, and it makes things work for copying output.
(Perhaps it's just an oversight, cf 4faf10e which added copying of
read-inputs.)
.
The decision of what input to copy maybe should be made based on the
number of real elements in the operands. You copy the output if the sum
of elements in the overlapping inputs is larger than the number of
elements in the output. This should give reasonable performance when
broadcasting.
.
The logic could also perhaps be made smarter and just use the iterator
buffering instead if the arrays are small enough. I don't know the logic
here well enough ATM however.

pv · 2015-08-17T09:00:55Z

Anyway, that changing this sort of thing doesn't require more work probably says good things about the structuring of the iterator/ufunc codebase.

seberg · 2015-08-17T09:31:56Z

@pv yeah, thanks! I was thinking this morning it was likely some ufunc stuff that made it break so badly. Yes, it seems pretty straight forward, though I am sure there are some details to still figure out. Anyway, I think we should put this in whenever it is ready, it should be used at various places soon enough (could create an issue to list some of the places that would be nice, i.e. iterator, dot, indexing and probably some other smaller things).

Actually, if the arrays are small, copying is faster then buffering I think, so only the opposite probably could make sense and that is not possible I think.

pv · 2015-08-17T10:45:48Z

I think this PR is fairly feature complete now. Better algorithms could be added later on, but aren't probably really necessary given that the "easy" case covers the real-life use cases seen so far.

charris · 2015-08-28T19:14:03Z

@pv Could you add a note to the 1.11.0 release notes? Under Improvements looks like a good place.

…p problems

These are used for testing the diophantine solver and memory overlap codes.

pv · 2015-08-29T15:40:49Z

Rebased.

Better memory overlap detection

charris · 2015-08-29T15:57:08Z

Let's give this a shot. Thanks Pauli.

juliantaylor · 2015-09-11T16:06:20Z

sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing npy_mul_with_overflow

pv · 2015-09-11T17:15:50Z

I didn't try to benchmark it without overflow checking, so unclear. The typical runtime cost is however probably not big compared to the other stuff going on in ufunc dispatch preparation.

-------- Alkuperäinen viesti --------
Lähettäjä: Julian Taylor [email protected]
Päivämäärä: 11.09.2015 19.06 (GMT+02:00)
Saaja: numpy/numpy [email protected]
Kopio: Pauli Virtanen [email protected]
Aihe: Re: [numpy] Better memory overlap detection (#6166)

sorry didn't follow this all, how relevant is the checked integer arithmetic for performance? if relevant some could be optimized with compiler intrinsics like the currently already existing npy_mul_with_overflow

—
Reply to this email directly or view it on GitHub.

pv mentioned this pull request Aug 5, 2015

Broken __iadd__ behavior on views #6119

Closed

njsmith reviewed Aug 6, 2015
View reviewed changes

numpy/core/tests/test_mem_overlap.py Outdated

Copy link

Member

njsmith Aug 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial style thing -- let's instantiate an RNG object instead of using the global RNG?

pv mentioned this pull request Aug 6, 2015

[MRG + 1] TST: fix undefined behavior in test scikit-learn/scikit-learn#5094

Merged

pv force-pushed the sld branch from d911ec1 to 4e4889f Compare August 16, 2015 23:31

pv force-pushed the sld branch 2 times, most recently from 66e5f16 to 6310a7f Compare August 17, 2015 17:10

pv changed the title ~~WIP: Better memory overlap detection~~ Better memory overlap detection Aug 20, 2015

pv added 8 commits August 29, 2015 18:19

ENH: core: add diophantine equation solver for deciding memory overla…

67ec47b

…p problems

ENH: core: add solve_diophantine/may_share_memory to multiarray_tests

e3d71b2

These are used for testing the diophantine solver and memory overlap codes.

TST: add tests checking diophantine and memory overlap solvers

74c4454

ENH: use 128-bit integers to avoid overflows in solve_diophantine

710beb2

TST: add tests for 128-bit and checked arithmetic

c317180

ENH: add shares_memory, implement may_share_memory using it

2b66f00

ENH: solve also internal overlap problems

da0a4dc

DOC: update docs + release notes vs shares_memory

3a52e19

pv force-pushed the sld branch from 604c14e to 3a52e19 Compare August 29, 2015 15:20

charris added a commit that referenced this pull request Aug 29, 2015

Merge pull request #6166 from pv/sld

7001d61

Better memory overlap detection

charris merged commit 7001d61 into numpy:master Aug 29, 2015

This was referenced Aug 29, 2015

Remove bento support #6268

Merged

possibility to calculate gradient over specific axes instead of all axes #6129

Merged

pv mentioned this pull request Aug 31, 2015

ENH: Automatic copy on overlap #6272

Closed

5 tasks

insertinterestingnamehere mentioned this pull request May 11, 2016

Callables should use the restrict keyword on input data pointers libdynd/libdynd#1196

Open

Uh oh!

Better memory overlap detection #6166

Better memory overlap detection #6166

Uh oh!

Conversation

pv commented Aug 5, 2015

Uh oh!

pv commented Aug 5, 2015

Uh oh!

charris commented Aug 5, 2015

Uh oh!

charris commented Aug 5, 2015

Uh oh!

argriffing commented Aug 5, 2015

Uh oh!

pv commented Aug 5, 2015 via email

Uh oh!

njsmith Aug 6, 2015

Choose a reason for hiding this comment

Uh oh!

njsmith commented Aug 6, 2015

Uh oh!

njsmith commented Aug 6, 2015

Uh oh!

akhmerov commented Aug 6, 2015

Uh oh!

seberg commented Aug 6, 2015

Uh oh!

akhmerov commented Aug 6, 2015

Uh oh!

akhmerov commented Aug 6, 2015

Uh oh!

seberg commented Aug 6, 2015

Uh oh!

njsmith commented Aug 6, 2015

Uh oh!

akhmerov commented Aug 6, 2015

Uh oh!

pv commented Aug 6, 2015

Uh oh!

seberg commented Aug 6, 2015

Uh oh!

pv commented Aug 6, 2015 via email

Uh oh!

seberg commented Aug 6, 2015

Uh oh!

seberg commented Aug 6, 2015

Uh oh!

pv commented Aug 6, 2015

Uh oh!

argriffing commented Aug 6, 2015

Uh oh!

charris commented Aug 6, 2015

Uh oh!

seberg commented Aug 7, 2015

Uh oh!

pv commented Aug 7, 2015 via email

Uh oh!

aarchiba commented Aug 7, 2015

Uh oh!

seberg commented Aug 7, 2015

Uh oh!

pv commented Aug 10, 2015

Uh oh!

pv commented Aug 16, 2015

Uh oh!

seberg commented Aug 17, 2015

Uh oh!

pv commented Aug 17, 2015

Uh oh!

pv commented Aug 17, 2015

Uh oh!

seberg commented Aug 17, 2015

Uh oh!

pv commented Aug 17, 2015

Uh oh!

charris commented Aug 28, 2015

Uh oh!

pv commented Aug 29, 2015

Uh oh!