ENH: Adding a count parameter to np.unpackbits #10855

madphysicist · 2018-04-06T18:01:04Z

This PR addresses the fact that the recommended method for truncating undesired unpacked elements is currently to reshape and subset (e.g., this SO post). This seems wasteful, especially for large arrays. Adding a count argument allows the unpacking to use only the desired number of bits, requiring no extra memory.

The suggested parameter is pretty straightforward: it will be the final size of the output array along the desired axis. It will error out if you ask for more bits than there are in the data. Negative values indicate how many to trim (so for example, when unpacking a mask of width bits from a regular packbits invocation, use -(width % 8)), and None is the existing default behavior (use everything).

There is also a fix to the existing docs, which do not accurately reflect the name of the first parameter to [un]packbits correctly.

madphysicist · 2018-04-07T15:15:17Z

numpy/core/src/multiarray/compiled_base.c

Added the code comment so I could add the PR comment. Is there a reason we are not using memmove here?

numpy/core/src/multiarray/compiled_base.c

madphysicist · 2018-04-07T15:32:50Z

numpy/core/src/multiarray/compiled_base.c

There has got to be a better way than writing my own 2-bit converter, but oh well...

I think PyArray_SizeConverter might be useful, but it probably belongs in a separate MAINT PR, since it has applications elsewhere. I think there's already a function you can call on a PyObject * to convert to an npy_intp, which would be better to just use in the scope of this PR.

Thanks. Could you point me to that function? I wasn't able to find something that matched. PyArray_intpConverter seems to do something entirely different.

BTW, I've moved this out of WIP status since all the tests passed in my local environment.

Just use PyArray_PyIntAsIntp, which takes a PyObject * and returns a npy_intp. A converter might be nice, but it's not in scope for this PR.

Why is it out of scope? I would end up writing the same bit of code anyway, and stashing it where it wouldn't be accessible...

If PyArray_SizeConverter is useful, it's useful with or without this patch. Conversely, this patch is implementable with or without that converter. There's no reason to bundle the changes together into one review.

My main problem with PyArray_PyIntAsIntp is that it does not allow None. What if I defined the converter locally, without adding it to any public APIs, and later submitted a PR for it if someone else wanted it?

Making it a public API would certainly make more sense in a separate review.

Also, you could just use:

npy_intp count = 0; if (count_obj != Py_None) { count = PyArray_PyIntAsIntp(count_obj); if (error_converting(count)) { goto fail; } }

I think I like the idea of a similar converter, but I'd like to see it:

In a separate PR

With examples elsewhere in numpy showing it is more widely useful

We don't want to add something to the public API without being sure we chose the right definition. As you observe, PyArray_intpConverter doesn't do what you hoped it would.

Gotcha. Sorry for the slowness.

madphysicist · 2018-04-07T15:33:11Z

numpy/core/src/multiarray/conversion_utils.c

Please let me know if something like this is defined elsewhere...

madphysicist · 2018-04-08T03:34:49Z

numpy/core/code_generators/numpy_api.py

I'll be totally honest, I have no idea what I did here or why. The error messages made it pretty clear that the indices need to be unique, but I have no idea why or what they do. Should I have just set this one to 304 (the next available number)? If so, why aren't they all just being generated?

These numbers are not allowed to change, because that will break ABI compatibility for code compiled against previous versions of numpy - so your new number must go at the end,

Definitely will fix then.

What do the numbers actually do?

They are indices in a table of function pointers, which are used to export the functions to other python extension modules.

madphysicist · 2018-04-08T05:13:12Z

numpy/core/src/multiarray/compiled_base.c

This bit really needs some review. Is there a reason memcpy wasn't used for this before? If not, did I use the correct sizeof?

Is there any reason that memcpy should be used here? It's less safe, and the loop should optimize to it anyway.

I was just messing around. I'll go ahead and revert if you think that is better.

If the type of outdims is wrong, this does the wrong thing in all cases, where the previous code would warn and only fail in overflow cases.

Since it's not part of this new feature, and not a clear improvement to me, I'd prefer it not be part of an ENH PR.

The goal was to see if I could get rid of i.

eric-wieser · 2018-04-08T05:35:11Z

numpy/core/src/multiarray/compiled_base.c

* 8 would be much clearer here

Strangely enough, I decided to keep the existing convention for this bit. I rather prefer << 3 for some reason.

eric-wieser · 2018-04-08T05:36:29Z

numpy/core/src/multiarray/compiled_base.c

Why did this move? Better to leave things in the narrowest scope possible

Yes. Forgot to move it back when I realized I could optimize the stride=1 loop without using the mask. It's going back ASAP.

eric-wieser · 2018-04-08T05:42:37Z

numpy/core/src/multiarray/compiled_base.c

nit: space after for

eric-wieser · 2018-04-08T05:49:56Z

This seems wasteful, especially for large arrays

This wastes at most 7 bytes, and incurs no additional copies over your solution. Once you flatten an array, slicing followed by reshaping is O(1).

madphysicist · 2018-04-08T05:56:34Z

7 bytes * prod(rest_of_shape). My intended use case was actually unpacking very large PBM images, which are basically arrays packed along axis=1. Sure 35K isn't huge, but still not 7.

madphysicist · 2018-04-08T05:59:00Z

But yes, this is largely a convenience, just not purely so. I think that in an abstract sense, there is value to making pack/unpack be completely invertible.

eric-wieser · 2018-04-08T06:00:08Z

I don't think your converter default is correct - if I pass count=0, I would expect an empty array back, not a full one. I think that the None-or-intp concept you want to represent is not something converters can handle.

madphysicist · 2018-04-08T06:07:09Z

I was a bit conflicted about that too. Will change zero to give empty array. That makes the converter pointless, which is actually nice.

madphysicist · 2018-04-08T06:37:48Z

@eric-wieser. I belive that I've covered all your requests at this point (including the * 8 :) The tests are up to date, as are the docs.

madphysicist · 2018-04-08T06:39:43Z

None-or-intp can be handled by a converter that allocates an intp, in which case None maps directly to NULL. I would not do such a thing pretty much ever, though.

numpy/core/src/multiarray/compiled_base.c

madphysicist · 2018-04-08T07:00:38Z

Did I actually cause the Travis CI failure? Or is is something else?

seberg · 2018-04-08T07:58:11Z

Yes, its your code pretty sure. Might want to run it in valgrind or so to find the issue quickly. Since it is in the relaxed strides test, maybe you are using a stride somewhere and assuming contiguity? (i.e. assuming that stride[0] == stride[1] * shape[1])

EDIT: PS: would it make sense to be make this/allow a bit mask as well?

madphysicist · 2018-04-09T03:54:29Z

The bug turned out to be a bit more subtle than just assuming things about the strides. Basically, I forgot to check if there is a tail in the case where outstride==1. Normally, inptr is a valid value, but when instride is deliberately set to insanity, memcpy ends up accessing an invalid address, even with no bytes to copy.

madphysicist · 2018-04-09T03:57:59Z

@eric-wieser @seberg In addition to fixing the last lingering bug, I've made a couple of modifications that you may not see as "enhancements". I would be happy to remove either of the last two commits from this PR if you have a reason not to like them.

madphysicist · 2018-04-09T04:02:12Z

@seberg I am not sure I understand your PS:

EDIT: PS: would it make sense to be make this/allow a bit mask as well?

Are you referring to the output dtype or inputting some sort of mask to determine which elements to unpack? The former would be easy: allowing any integer type or boolean for the unpacked result seems reasonable. Might even be easy with just a stride adjustment and an offset to determine if the output is big or little endian. The latter seems like overkill for this PR. That being said, I suspect you had neither of those options in mind...

eric-wieser · 2018-04-10T07:55:25Z

Why is short-circuiting for size-0 worthwhile? You're making a case that almost never happens marginally faster, at the expense of a marginal slowdown for every other case, and increased complexity to the reader.

I'd prefer it without that case.

Special cases aren't special enough to break the rules.

eric-wieser · 2018-04-10T07:57:15Z

numpy/add_newdocs.py

Not sure the negative number is a useful feature here

eric-wieser · 2018-04-10T07:57:41Z

numpy/add_newdocs.py

Padding with zeros would be more useful here, I think?

madphysicist · 2018-04-10T14:03:30Z

@eric-wieser. I agree with your comments (and made the appropriate fixes), except the negative count. I wanted it to be consistent with the usual indexing scheme, where negatives have the same meaning.

madphysicist · 2019-01-22T22:17:02Z

I believe that this PR is ready to go now. I had to compute the threading threshold separately in order to avoid divide-by-zero errors. I hope that I did it in a way that you can agree is fairly painless.

numpy/core/src/multiarray/compiled_base.c

seberg · 2019-01-29T20:31:49Z

Sorry this has been hanging a bit, was a bit busy and sick. I think I will make another quick pass over soon, but should be ready (although I am not sure I read the tests very carefully before).

There is still a compiler warning that needs fixing for the tests to pass currently.

madphysicist · 2019-01-29T20:37:55Z

@seberg Thanks for clarifying that. I have been sick and busy myself, so gave up quickly on searching for the bug in the code (that I couldn't seem to reproduce). Good to know it's just a casting thing. I'll have that fixed ASAP.

madphysicist · 2019-02-07T20:40:30Z

@seberg ASAP wasn't S at all, but it looks like everything is OK now.

doc/release/1.17.0-notes.rst

numpy/core/multiarray.py

madphysicist · 2019-02-07T23:23:06Z

numpy/core/src/multiarray/compiled_base.c

@seberg. As you can see here, myarray was never an option to begin with. The documented keyword name never actually what the function was parsing, and the name being parsed was an illegal name. Although I suppose you could do

kw = {'in': ...} unpackbits(**kw)

Fair enough. a is better anyway. I wonder if we should at some point work on unifying most of these names (basically a backward incompatible change that would in practice probably hit almost no-one).

doc/release/1.17.0-notes.rst

madphysicist · 2019-02-08T19:51:55Z

@seberg I think this is ready to go now.

seberg · 2019-02-14T19:28:09Z

Yes, looks ready. I will put it in a day or two unless something comes up. @mattip this will probably cause merge conflicts with your order addition.

mattip · 2019-02-17T09:35:49Z

numpy/core/multiarray.py

Single quotes are a shortcut for :py:value and are meant to be linked to python documentation. When creating the link fails, the text is rendered in italics. Since the intention here is for code formatting, they should all be two quotes. You can see the difference in the way the default axis argument None used to be rendered here vs this PR here

@mattip. Most of the docs I see have single quotes. I prefer the aesthetic of the italicized version to the code formatted one, and technically None is a python object. However, I don't care much one way or the other, so will fix as you request ASAP.

It is strange that with single quotes, it wouldn't link to the python documentation of None already. Indeed, that seems a bug...

I think I slightly prefer the double backticks for None (there is no point in linking it to the python docs anyway).

https://docs.scipy.org/doc/numpy/docs/howto_document.html has some examples using single ticks for the kwargs, and actually nothing for None. Could be nice to just decide for once, not sure I care though.

I'll grep and count both versions in the existing code. My intuition tells me that single ticks dominate, but I'm not sure. Majority rules? :)

None is just an example here. There are other single-ticked words in the documentation change: a, count, index,

Yeah, although for a, count and axis at least, they are arguments, and the howto document examples use single backticks for those. However, while I would guess a lot of people somewhat use that document, I doubt it is applied consistently with respect to these examples in any case.

Let's call it a niggling nit and ignore it, is this ready for merge otherwise?

There were 145 occurrences of None with single quotes and 147 occurrences with double quotes until this PR came along. I will change back to double. Sorry for the holdup.

@mattip If you want me to go through and normalize all the other occurences, I can do that fairly quickly, since I have the full list and a script that can do it right in front of me.

mattip · 2019-02-25T14:56:49Z

@seberg is there more to do here?

Tests are included. Couple of minor fixes: - Fixed packbits/unpackbits docs to reflect proper keyword names - Added .pytest_cache to .gitignore

seberg · 2019-02-25T15:28:22Z

No, don't think so. Thanks @madphysicist, sorry it took so long. Thanks Matti for the reminder. Going to wait for the test build, were there any changes in that last push?

madphysicist · 2019-02-25T15:29:41Z

@seberg Nothing besides some backticks around None and a rebase onto the latest master branch.

madphysicist · 2019-02-25T18:58:53Z

@seberg Looks ready to go

seberg · 2019-02-25T19:48:31Z

OK, thanks @madphysicist and everyone!

madphysicist commented Apr 7, 2018

View reviewed changes

madphysicist commented Apr 8, 2018

View reviewed changes

madphysicist force-pushed the unpack-count branch from e385035 to 4931d2f Compare April 8, 2018 05:09

madphysicist commented Apr 8, 2018

View reviewed changes

madphysicist changed the title ~~WIP: ENH: Adding a count parameter to np.unpack~~ ENH: Adding a count parameter to np.unpack Apr 8, 2018

eric-wieser reviewed Apr 8, 2018

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Outdated

Copy link

Member

eric-wieser Apr 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after for

eric-wieser changed the title ~~ENH: Adding a count parameter to np.unpack~~ ENH: Adding a count parameter to np.unpack_bits Apr 8, 2018

eric-wieser changed the title ~~ENH: Adding a count parameter to np.unpack_bits~~ ENH: Adding a count parameter to np.unpackbits Apr 8, 2018

madphysicist commented Apr 8, 2018

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Outdated Show resolved Hide resolved

charris added 01 - Enhancement component: numpy._core labels Apr 9, 2018

eric-wieser reviewed Apr 10, 2018

View reviewed changes

numpy/add_newdocs.py Outdated

Copy link

Member

eric-wieser Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the negative number is a useful feature here

eric-wieser reviewed Apr 10, 2018

View reviewed changes

numpy/add_newdocs.py Outdated

Copy link

Member

eric-wieser Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding with zeros would be more useful here, I think?

madphysicist reacted with thumbs up emoji

madphysicist force-pushed the unpack-count branch 3 times, most recently from 174852a to 8aedcc0 Compare January 16, 2019 00:34

madphysicist mentioned this pull request Jan 16, 2019

DOC: Missing -std=c99 flags somewhere? #12766

Closed

madphysicist force-pushed the unpack-count branch from 8aedcc0 to 61b2dc7 Compare January 22, 2019 19:54

seberg reviewed Jan 23, 2019

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Outdated Show resolved Hide resolved

madphysicist force-pushed the unpack-count branch from 61b2dc7 to 4b53d4e Compare January 24, 2019 18:12

seberg self-assigned this Jan 29, 2019

madphysicist force-pushed the unpack-count branch from 4b53d4e to 86b7c27 Compare February 7, 2019 20:10

seberg reviewed Feb 7, 2019

View reviewed changes

doc/release/1.17.0-notes.rst Outdated Show resolved Hide resolved

seberg reviewed Feb 7, 2019

View reviewed changes

numpy/core/multiarray.py Outdated Show resolved Hide resolved

madphysicist commented Feb 8, 2019

View reviewed changes

madphysicist force-pushed the unpack-count branch from 86b7c27 to b32a818 Compare February 8, 2019 19:21

mattip reviewed Feb 17, 2019

View reviewed changes

mattip mentioned this pull request Feb 25, 2019

ENH: Add 'bitorder' keyword to packbits, unpackbits #12962

Merged

ENH: Added count parameter to unpackbits

df00dbf

Tests are included. Couple of minor fixes: - Fixed packbits/unpackbits docs to reflect proper keyword names - Added .pytest_cache to .gitignore

madphysicist force-pushed the unpack-count branch from b32a818 to df00dbf Compare February 25, 2019 15:22

seberg merged commit 269d985 into numpy:master Feb 25, 2019

madphysicist deleted the unpack-count branch February 25, 2019 20:50

Uh oh!

ENH: Adding a count parameter to np.unpackbits #10855

ENH: Adding a count parameter to np.unpackbits #10855

Uh oh!

Conversation

madphysicist commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madphysicist Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madphysicist commented Apr 8, 2018

Uh oh!

madphysicist commented Apr 8, 2018

Uh oh!

eric-wieser commented Apr 8, 2018

Uh oh!

madphysicist commented Apr 8, 2018

Uh oh!

madphysicist commented Apr 6, 2018 •

edited

Loading

eric-wieser Apr 8, 2018 •

edited

Loading

eric-wieser Apr 8, 2018 •

edited

Loading

madphysicist Apr 8, 2018 •

edited

Loading

eric-wieser commented Apr 8, 2018 •

edited

Loading

seberg commented Apr 8, 2018 •

edited

Loading

eric-wieser commented Apr 10, 2018 •

edited

Loading