ENH: Faster algorithm for computing histograms with equal-size bins#6100
Conversation
numpy/lib/function_base.py
Outdated
There was a problem hiding this comment.
For indices and counts we always use np.intp rather than int.
There was a problem hiding this comment.
Ah ok - but did you mean to comment on the line with indices = tmp_a.astype(int)? (ntype is not used for indices - the code you commented as is as before, I just moved it)
There was a problem hiding this comment.
Oh, I see! It then is a bug on the current implementation. An unlikely bug, but the histogram of an array of 2**31 items on a 64 bit system, all of which end up in the same bin, will show a -1 count for that bin, which is not quite what we are after. Since you are at it, you may as well fix it, since it does affect your code: I'm pretty sure bincount always returns either np.intp (without weights) or np.double (with weights).
There was a problem hiding this comment.
@jaimefrio - ah, I see. Actually, testing for this uncovered a bug in bincount:
In [1]: import numpy as np
In [2]: values = np.ones(2**32, dtype=np.intp)
In [3]: np.bincount(values)
/Volumes/Raptor/miniconda3/envs/production/bin/python.app: line 3: 7001 Segmentation fault: 11 /Volumes/Raptor/miniconda3/envs/production/python.app/Contents/MacOS/python "$@"
Do you see the same issue?
|
One generic comment: by using |
|
@jaimefrio - thanks for the feedback! I agree about needing to be careful about the weights type. I'm not sure if the code was originally intended to be used with non-float weights since there was the comment |
|
@jaimefrio - I've now made sure that if the weights are complex, we do the right thing with bincount, and if they are not complex, float, or int, we revert to the original algorithm (and I added a test for all these). Note that there is still an issue with |
1c73e0a to
6fc0889
Compare
|
Quick question: In one of my use cases, instead of providing the number of bins, I provide an array of equally spaced bins. The obvious fix is on my end but I believe you are technically speaking not using the optimized code for the case where an array of equally spaced bins is provided. Is that correct? Would there be problems associated doing that check as well? I.e. something along the lines of bin_widths = np.diff(bins)
if not iterable(bins) or np.all(bin_widths == bin_widths[0])There's some overhead for that check of course and there are probably rounding issues so the simplest way out may just be to use the function as intended. |
|
@ctk3b, you might want the The issue with checking the bin widths is that it can fail due to rounding error, since a set of bins that are equally spaced in floating point doesn't have to pass that check. e.g.: edit: doh I see you are aware that rounding issues exist. |
|
Yeah that's the tricky part I guess. One thing to maybe consider, if the cost of the check is acceptable, is to do a *but they have to be absolutely sure that's what they want |
|
I personally don't think that we should check for equal bin spacing in the bins array (if it's not an int), but happy to add it if there is a consensus. An alternative is to simply make the docstring clearer and add a note about how to get optimal performance. |
|
@jaimefrio - apart from that question, I think this is pretty much ready for a final review. |
numpy/lib/function_base.py
Outdated
There was a problem hiding this comment.
I think this would be better implemented as:
if weights is not None and not (np.can_cast(weights.dtype, np.double) or
np.can_cast(weights.dtype, np.complex)):
|
It mostly LGTM. Have quite a few questions/comments, but aside from the int-that-should-be-an-intp, there is nothing fundamentally wrong that I see. |
|
@jaimefrio - I'll reply here about the blocks/chunks to make sure the comment stays here in the long term. I copied the use of the blocks from the older/slower method. It turns out that doing this results in a factor of 2 improvement in performance. Let's take the case of a 10^8 array. The times to do the histogram (with the new method in this PR) are: So with While slightly inelegant, I do think a 2x improvement in speed and 3x improvement in memory usage is worth it. In fact, I think that having a 3x worsening in memory usage would be considered a performance regression by many, even if the code ran 5x faster. Do you agree that it makes sense to leave it as-is? I can add a comment about why this is done. |
|
I'll buy that 2x. Thanks for takinig the time to substantiate with measurements all my existential doubts, Correct me if I am wrong, but it seems we both agree that, eventually, the right thing to do would be to rewrite at least this particular path in C, right? If you could fix that int-that-should-be-an-intp, and get rid of some of the extraneous white space, I'll go ahead and merge it. |
|
@jaimefrio - yes, I think in the long term we could benefit from switching over to C and I'd be happy to look into it (but that will need to wait a few weeks). I'll finish up the current pull request now so that you can go ahead and merge it. |
5ab0ad5 to
4325078
Compare
|
@jaimefrio - I've made the fixes you requested. I'd be interested in trying to tackle the switching over to C - should I open an issue to keep track of it, or should I just open a pull request once I have something working? |
|
@jaimefrio - I can also look into applying the same kind of optimization to the |
There was a problem hiding this comment.
this seems poorly chosen, this is one L1 cache and numpy likes its copies so you always end up wiping it.
I get a 30% performance increase for N = 20000 when block is 4 times smaller, though the same can be achived by doing the keep computation in place.
There was a problem hiding this comment.
ah its not the cache (it is very hard to make use of L1 with numpy unfortunately), its removes 200.000 page faults, 64k seems to trigger some different code path in glibc, not sure why yet, glibc usually only switches allocation mode at much larger values
There was a problem hiding this comment.
Just to clarify, I just used the value that was set for the original slow method, but as I indicated in #6100 (comment) for large arrays we'd actually benefit from a slightly larger block size.
Rather than spend too much time trying to find a one-size-fits-all block size, I would suggest leaving the block size for now since this is code that I plan to try and replace with a C iteration (as @jaimefrio and I were discussing).
However, I agree that the keep operation could be done in-place instead. I'll look at this now.
There was a problem hiding this comment.
@juliantaylor - actually could you clarify what you mean by doing the keep computation inplace? I tried doing:
keep = tmp_a >= mn
np.logical_and(keep, tmp_a <= mx, out=keep)but I don't see any improvements. Did you mean something else?
I also just checked with BLOCK=65536/4 for different array sizes from 1 to 1e8 but couldn't see any for which this was faster than the current block size.
There was a problem hiding this comment.
this is more readable:
keep = (tmp_a >= mn)
keep &= (tmp_a <= mx)
I don't really understand what is going on, though my simple testcase seems to trigger some bad memory allocation pattern. If I add another histogram call of another size it disappears.
It still might be a good idea as its there is at least one case where its significantly better and just adds one line
There was a problem hiding this comment.
Ok, done in 46b28f30539bacc19e803f4392fab95fc3c127db!
|
6 commits seem a few too many: if you can squash them into a single one, we can put it in. |
46b28f3 to
34b582a
Compare
|
@jaimefrio - done! |
There was a problem hiding this comment.
I think the correct iteration limit is something like range(0, ((len(a) - 1) // BLOCK + 1) * BLOCK, BLOCK). With the current limit an array of size a multiple of BLOCK will run an extra iteration on an empty array.
Also, I think using Python's range here is better than NumPy's arange, as it will be an iterator in Py3K, and in Python 2 building a list is typically faster than an array.
There was a problem hiding this comment.
Should I change this for the 'original' method too? (I copied it from there)
There was a problem hiding this comment.
@jaimefrio - are you sure about the extra iteration? If len(a) is a multiple of BLOCKSIZE, then because range is exclusive at the upper end, I think it does the right thing:
In [3]: np.arange(0, 30, 10)
Out[3]: array([ 0, 10, 20])
(but yes, we could simply change to Python's range)
There was a problem hiding this comment.
Yes, and a[20:20+10] is the full last block.
edit: maybe the confusing thing is that python ranges do not care about out of bounds.
There was a problem hiding this comment.
@seberg - so you agree that the only thing that needs changing is arange -> range, but otherwise it's ok? (want to make sure I wait until there is a consensus otherwise if I amend the existing commit and force push it might get rid of this thread)
There was a problem hiding this comment.
Sorry, would look like it to me, it was a bit of a stupid comment, misread the rest.
There was a problem hiding this comment.
@jaimefrio - do you agree? If so, will make the arange -> range change
There was a problem hiding this comment.
And that's why you don't review code after midnight... You, and whoever wrote the first version, are absolutely right, I was absolutely wrong. I won't make you push yet another change for the range/arange thing.
ENH: Faster algorithm for computing histograms with equal-size bins
|
Thanks a lot Thomas. |
|
@jaimefrio - thanks for all your comments! Once I get a chance (reasonably soon hopefully) I'll investigate some of the further improvements discussed here, including for example switching to C for part of the code. |
| mn -= 0.5 | ||
| mx += 0.5 | ||
| # At this point, if the weights are not integer, floating point, or | ||
| # complex, we have to use the slow algorithm. |
There was a problem hiding this comment.
What is the rationale for this? What type of weight would not work here?
There was a problem hiding this comment.
I can't remember for sure, but I think e.g. object arrays work in the weights array:
In [7]: np.histogram([1,2,3], weights=np.array([Decimal(1), Decimal(2.2), Decimal(4.4)], dtype=np.object))
Out[7]:
(array([Decimal('1'), Decimal('0'), Decimal('0'), Decimal('0'),
Decimal('0'), Decimal('2.200000000000000177635683940'),
Decimal('0E-27'), Decimal('0E-27'), Decimal('0E-27'),
Decimal('4.400000000000000355271367880')], dtype=object),
array([ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]))There was a problem hiding this comment.
Yep, that's it, thanks - you're guarding bincount, that only works for the types you list there.
About
When bins are equal-sized, we don't have use
searchsortedto find which bins values fall in, we can instead just scale and convert the values to the bin indices and usebincount.Results
I carried out speed tests with arrays going from 1 to 10^8 elements. The values were randomly sampled with:
and the histogram was computed with:
The speed difference is as follows:
The new method is always faster, and for more than 10^5 elements, the speedup is a factor of ~10x.
I then checked that the memory usage is not increased, when binning an array with 10^9 values:
There is no memory penalty!
Note that I originally implemented this without the iteration over blocks, and not only did it use more memory, it was 2x as slow as the version with looping over chunks (to my surprise).
Future work
If this PR is merged, we can also consider switching to using C or Cython for the part where the bin indices are computed. This could provide a further ~3x improvement in performance.
Notes
This partially addresses #6099 (np.histogram could be made even faster by switching to C in places)
A similar speedup could be made for
histogramddOf course, given how extensively this function is used, it would be great if people could subject to more rigorous testing (I will also do some more testing to see if I can find cases where there is a performance regression)