ENH: Faster algorithm for computing histograms with equal-size bins by astrofrog · Pull Request #6100 · numpy/numpy

astrofrog · 2015-07-21T17:12:49Z

About

When bins are equal-sized, we don't have use searchsorted to find which bins values fall in, we can instead just scale and convert the values to the bin indices and use bincount.

Results

I carried out speed tests with arrays going from 1 to 10^8 elements. The values were randomly sampled with:

x = np.random.uniform(-10., 100., N)

and the histogram was computed with:

hist, edges = np.histogram(x, bins=45, range=[-5, 99.])

The speed difference is as follows:

The new method is always faster, and for more than 10^5 elements, the speedup is a factor of ~10x.

I then checked that the memory usage is not increased, when binning an array with 10^9 values:

There is no memory penalty!

Note that I originally implemented this without the iteration over blocks, and not only did it use more memory, it was 2x as slow as the version with looping over chunks (to my surprise).

Future work

If this PR is merged, we can also consider switching to using C or Cython for the part where the bin indices are computed. This could provide a further ~3x improvement in performance.

Notes

This partially addresses #6099 (np.histogram could be made even faster by switching to C in places)

A similar speedup could be made for histogramdd

Of course, given how extensively this function is used, it would be great if people could subject to more rigorous testing (I will also do some more testing to see if I can find cases where there is a performance regression)

jaimefrio · 2015-07-21T17:32:50Z

numpy/lib/function_base.py

For indices and counts we always use np.intp rather than int.

Ah ok - but did you mean to comment on the line with indices = tmp_a.astype(int)? (ntype is not used for indices - the code you commented as is as before, I just moved it)

Oh, I see! It then is a bug on the current implementation. An unlikely bug, but the histogram of an array of 2**31 items on a 64 bit system, all of which end up in the same bin, will show a -1 count for that bin, which is not quite what we are after. Since you are at it, you may as well fix it, since it does affect your code: I'm pretty sure bincount always returns either np.intp (without weights) or np.double (with weights).

@jaimefrio - ah, I see. Actually, testing for this uncovered a bug in bincount:

In [1]: import numpy as np In [2]: values = np.ones(2**32, dtype=np.intp) In [3]: np.bincount(values) /Volumes/Raptor/miniconda3/envs/production/bin/python.app: line 3: 7001 Segmentation fault: 11 /Volumes/Raptor/miniconda3/envs/production/python.app/Contents/MacOS/python "$@"

Do you see the same issue?

jaimefrio · 2015-07-21T18:03:13Z

One generic comment: by using bincount you are forcing your weighted histogram to be convertible to double. Not sure if there is a use case for complex, or object weights, but it may be a good idea to route those to the generic, slower algorithm.

astrofrog · 2015-07-21T18:25:14Z

@jaimefrio - thanks for the feedback! I agree about needing to be careful about the weights type. I'm not sure if the code was originally intended to be used with non-float weights since there was the comment # Histogram is an integer or a float array depending on the weights. but I'll try and find some examples of non-float weights and will add some unit tests (and will make sure we use the slower algorithm there).

astrofrog · 2015-07-21T20:11:45Z

@jaimefrio - I've now made sure that if the weights are complex, we do the right thing with bincount, and if they are not complex, float, or int, we revert to the original algorithm (and I added a test for all these).

Note that there is still an issue with bincount mentioned in #6100 (comment) but I think that can be dealt with separately. It's not specific to this PR, and it's a definite corner case.

ctk3b · 2015-07-21T20:32:13Z

Quick question: In one of my use cases, instead of providing the number of bins, I provide an array of equally spaced bins. The obvious fix is on my end but I believe you are technically speaking not using the optimized code for the case where an array of equally spaced bins is provided. Is that correct?

Would there be problems associated doing that check as well? I.e. something along the lines of

bin_widths = np.diff(bins)
if not iterable(bins) or np.all(bin_widths == bin_widths[0])

There's some overhead for that check of course and there are probably rounding issues so the simplest way out may just be to use the function as intended.

ewmoore · 2015-07-21T20:59:45Z

@ctk3b, you might want the range keyword.

The issue with checking the bin widths is that it can fail due to rounding error, since a set of bins that are equally spaced in floating point doesn't have to pass that check. e.g.:

In [136]: a = np.arange(10) * (1./7)

In [137]: bins = np.arange(10) * (1./7)

In [138]: bin_widths = np.diff(bins)

In [139]: bin_widths == bin_widths[0]
Out[139]: array([ True,  True,  True,  True, False, False, False, False, False], dtype=bool)

edit: doh I see you are aware that rounding issues exist.

ctk3b · 2015-07-21T21:35:41Z

Yeah that's the tricky part I guess. One thing to maybe consider, if the cost of the check is acceptable, is to do a np.allclose() check and suggest to the user that they might want to consider using range and specify number of bins instead.

*but they have to be absolutely sure that's what they want

astrofrog · 2015-07-21T22:10:24Z

I personally don't think that we should check for equal bin spacing in the bins array (if it's not an int), but happy to add it if there is a consensus. An alternative is to simply make the docstring clearer and add a note about how to get optimal performance.

astrofrog · 2015-07-21T22:11:11Z

@jaimefrio - apart from that question, I think this is pretty much ready for a final review.

jaimefrio · 2015-07-22T00:58:32Z

numpy/lib/function_base.py

I think this would be better implemented as:

if weights is not None and not (np.can_cast(weights.dtype, np.double) or np.can_cast(weights.dtype, np.complex)):

jaimefrio · 2015-07-22T01:29:02Z

It mostly LGTM. Have quite a few questions/comments, but aside from the int-that-should-be-an-intp, there is nothing fundamentally wrong that I see.

astrofrog · 2015-07-22T08:42:18Z

@jaimefrio - I'll reply here about the blocks/chunks to make sure the comment stays here in the long term.

I copied the use of the blocks from the older/slower method. It turns out that doing this results in a factor of 2 improvement in performance. Let's take the case of a 10^8 array. The times to do the histogram (with the new method in this PR) are:

BLOCK         TIME (s)
2**16         1.15103316307 (current implementation)
2**17         1.10001802444
2**18         1.08962392807
2**19         1.10686206818
2**20         1.23000407219
2**21         1.69827389717
2**22         1.80364918709
2**23         1.91272115707
...
2**28         2.28366589546

So with BLOCK = 65536 we are basically 2x faster than with no chunking (it turns out we could do even better with BLOCK=2**18). I presume this must have something to do with larger arrays needing disproportionately more time to be allocated? In addition, the memory requirements are very different - for the 10^8 array, the peak RAM with BLOCK = 65536 is a little under 900MB, while if we don't use blocks, the memory requirement goes up to 2.7Gb.

While slightly inelegant, I do think a 2x improvement in speed and 3x improvement in memory usage is worth it. In fact, I think that having a 3x worsening in memory usage would be considered a performance regression by many, even if the code ran 5x faster. Do you agree that it makes sense to leave it as-is? I can add a comment about why this is done.

jaimefrio · 2015-07-22T15:31:26Z

I'll buy that 2x. Thanks for takinig the time to substantiate with measurements all my existential doubts,

Correct me if I am wrong, but it seems we both agree that, eventually, the right thing to do would be to rewrite at least this particular path in C, right?

If you could fix that int-that-should-be-an-intp, and get rid of some of the extraneous white space, I'll go ahead and merge it.

astrofrog · 2015-07-22T15:37:38Z

@jaimefrio - yes, I think in the long term we could benefit from switching over to C and I'd be happy to look into it (but that will need to wait a few weeks). I'll finish up the current pull request now so that you can go ahead and merge it.

astrofrog · 2015-07-22T15:56:45Z

@jaimefrio - I've made the fixes you requested.

I'd be interested in trying to tackle the switching over to C - should I open an issue to keep track of it, or should I just open a pull request once I have something working?

astrofrog · 2015-07-22T16:17:45Z

@jaimefrio - I can also look into applying the same kind of optimization to the histogramdd (and by extension histogram2d).

juliantaylor · 2015-07-22T19:50:15Z

numpy/lib/function_base.py

this seems poorly chosen, this is one L1 cache and numpy likes its copies so you always end up wiping it.
I get a 30% performance increase for N = 20000 when block is 4 times smaller, though the same can be achived by doing the keep computation in place.

ah its not the cache (it is very hard to make use of L1 with numpy unfortunately), its removes 200.000 page faults, 64k seems to trigger some different code path in glibc, not sure why yet, glibc usually only switches allocation mode at much larger values

Just to clarify, I just used the value that was set for the original slow method, but as I indicated in #6100 (comment) for large arrays we'd actually benefit from a slightly larger block size.

Rather than spend too much time trying to find a one-size-fits-all block size, I would suggest leaving the block size for now since this is code that I plan to try and replace with a C iteration (as @jaimefrio and I were discussing).

However, I agree that the keep operation could be done in-place instead. I'll look at this now.

@juliantaylor - actually could you clarify what you mean by doing the keep computation inplace? I tried doing:

keep = tmp_a >= mn np.logical_and(keep, tmp_a <= mx, out=keep)

but I don't see any improvements. Did you mean something else?

I also just checked with BLOCK=65536/4 for different array sizes from 1 to 1e8 but couldn't see any for which this was faster than the current block size.

this is more readable:

keep = (tmp_a >= mn) keep &= (tmp_a <= mx)

I don't really understand what is going on, though my simple testcase seems to trigger some bad memory allocation pattern. If I add another histogram call of another size it disappears.
It still might be a good idea as its there is at least one case where its significantly better and just adds one line

Ok, done in 46b28f30539bacc19e803f4392fab95fc3c127db!

jaimefrio · 2015-07-23T06:49:38Z

6 commits seem a few too many: if you can squash them into a single one, we can put it in.

astrofrog · 2015-07-23T07:17:55Z

@jaimefrio - done!

jaimefrio · 2015-07-23T08:05:00Z

numpy/lib/function_base.py

I think the correct iteration limit is something like range(0, ((len(a) - 1) // BLOCK + 1) * BLOCK, BLOCK). With the current limit an array of size a multiple of BLOCK will run an extra iteration on an empty array.

Also, I think using Python's range here is better than NumPy's arange, as it will be an iterator in Py3K, and in Python 2 building a list is typically faster than an array.

Should I change this for the 'original' method too? (I copied it from there)

@jaimefrio - are you sure about the extra iteration? If len(a) is a multiple of BLOCKSIZE, then because range is exclusive at the upper end, I think it does the right thing:

In [3]: np.arange(0, 30, 10) Out[3]: array([ 0, 10, 20])

(but yes, we could simply change to Python's range)

Yes, and a[20:20+10] is the full last block.
edit: maybe the confusing thing is that python ranges do not care about out of bounds.

@seberg - so you agree that the only thing that needs changing is arange -> range, but otherwise it's ok? (want to make sure I wait until there is a consensus otherwise if I amend the existing commit and force push it might get rid of this thread)

Sorry, would look like it to me, it was a bit of a stupid comment, misread the rest.

@jaimefrio - do you agree? If so, will make the arange -> range change

And that's why you don't review code after midnight... You, and whoever wrote the first version, are absolutely right, I was absolutely wrong. I won't make you push yet another change for the range/arange thing.

ENH: Faster algorithm for computing histograms with equal-size bins

jaimefrio · 2015-07-23T14:54:03Z

Thanks a lot Thomas.

astrofrog · 2015-07-23T16:39:58Z

@jaimefrio - thanks for all your comments! Once I get a chance (reasonably soon hopefully) I'll investigate some of the further improvements discussed here, including for example switching to C for part of the code.

eric-wieser · 2017-10-19T06:10:09Z

numpy/lib/function_base.py

            mn -= 0.5
            mx += 0.5
+        # At this point, if the weights are not integer, floating point, or
+        # complex, we have to use the slow algorithm.


What is the rationale for this? What type of weight would not work here?

I can't remember for sure, but I think e.g. object arrays work in the weights array:

In [7]: np.histogram([1,2,3], weights=np.array([Decimal(1), Decimal(2.2), Decimal(4.4)], dtype=np.object)) Out[7]: (array([Decimal('1'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('2.200000000000000177635683940'), Decimal('0E-27'), Decimal('0E-27'), Decimal('0E-27'), Decimal('4.400000000000000355271367880')], dtype=object), array([ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]))

Yep, that's it, thanks - you're guarding bincount, that only works for the types you list there.

astrofrog mentioned this pull request Jul 21, 2015

Improve np.histogram performance for uniform bins #6099

Closed

jaimefrio reviewed Jul 21, 2015
View reviewed changes

astrofrog mentioned this pull request Jul 21, 2015

Move computation of data 'views' to Data objects glue-viz/glue#708

Closed

swails mentioned this pull request Jul 21, 2015

[ENH] Faster histogram mdtraj/mdtraj#734

Closed

astrofrog force-pushed the faster-histogram-equal-bins branch from 1c73e0a to 6fc0889 Compare July 21, 2015 20:18

jaimefrio mentioned this pull request Jul 21, 2015

BUG: bincount overflow if arrays len >= 2^31 #6101

Merged

jaimefrio reviewed Jul 22, 2015
View reviewed changes

astrofrog force-pushed the faster-histogram-equal-bins branch from 5ab0ad5 to 4325078 Compare July 22, 2015 15:54

juliantaylor reviewed Jul 22, 2015
View reviewed changes

ENH: Faster algorithm for computing histograms with equal-size bins

34b582a

astrofrog force-pushed the faster-histogram-equal-bins branch from 46b28f3 to 34b582a Compare July 23, 2015 07:17

jaimefrio reviewed Jul 23, 2015
View reviewed changes

jaimefrio added a commit that referenced this pull request Jul 23, 2015

Merge pull request #6100 from astrofrog/faster-histogram-equal-bins

eb7104e

ENH: Faster algorithm for computing histograms with equal-size bins

jaimefrio merged commit eb7104e into numpy:master Jul 23, 2015

homu mentioned this pull request Jul 23, 2015

ENH: Automatic number of bins for np.histogram #6029

Merged

rgommers added 01 - Enhancement component: numpy.lib labels Jul 26, 2015

rgommers mentioned this pull request Jul 26, 2015

BUG: regression in np.histogram #6120

Closed

rgommers added this to the 1.10.0 release milestone Jul 26, 2015

nayyarv mentioned this pull request Sep 8, 2015

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

Closed

ctk3b mentioned this pull request Oct 20, 2015

Use improved numpy histogram in RDF. mdtraj/mdtraj#952

Merged

rkern mentioned this pull request May 12, 2016

np.histogram() inconsistent results #7628

Closed

eric-wieser reviewed Oct 19, 2017

View reviewed changes

kadri-nizam mentioned this pull request Mar 5, 2023

Faster histograms MPoL-dev/MPoL#171

Merged

Uh oh!

Conversation

astrofrog commented Jul 21, 2015

About

Results

Future work

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented Jul 21, 2015

Uh oh!

astrofrog commented Jul 21, 2015

Uh oh!

astrofrog commented Jul 21, 2015

Uh oh!

ctk3b commented Jul 21, 2015

Uh oh!

ewmoore commented Jul 21, 2015

Uh oh!

ctk3b commented Jul 21, 2015

Uh oh!

astrofrog commented Jul 21, 2015

Uh oh!

astrofrog commented Jul 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented Jul 22, 2015

Uh oh!

astrofrog commented Jul 22, 2015

Uh oh!

jaimefrio commented Jul 22, 2015

Uh oh!

astrofrog commented Jul 22, 2015

Uh oh!

astrofrog commented Jul 22, 2015

Uh oh!

astrofrog commented Jul 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented Jul 23, 2015

Uh oh!

astrofrog commented Jul 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented Jul 23, 2015

Uh oh!

astrofrog Oct 19, 2017 •

edited

Loading