BUG: Provide correct format in Py_buffer for scalars#10564
BUG: Provide correct format in Py_buffer for scalars#10564eric-wieser merged 10 commits intonumpy:masterfrom
Conversation
|
|
||
| # TODO: 'p', 'l', or 'q'? | ||
| # (np.intp, 'p'), | ||
| # (np.uintp, 'P'), |
There was a problem hiding this comment.
p won't be possible, because intp is just an alias for one of the other integer types - they're not distinguishable
There was a problem hiding this comment.
This doesn't report the endianness correctly. Note that as soon as you add endianness markers, you also have to use standard length codes, rather than the native ones (use the ones based on sizeof not the type name).
This also seems to reimplement stuff that I we already have for ndarray. Can you work out some way to share that code? There's a dtype to PEP3118 converter somewhere, probably _internal.py
Also, tests for structured void scalars would be good.
|
Is there a way create a scalar that isn't native endianess? |
|
Nice catch, maybe there isn't. |
|
|
||
| outcode = PyArray_DescrFromScalar(self); | ||
|
|
||
| if (flags & PyBUF_FORMAT && 3 <= sizeof view->internal ) { |
There was a problem hiding this comment.
Parens around (a & b) would be nice.
Are you sure sizeof(view->internal) does anything useful here?
There was a problem hiding this comment.
Currently pointless, but the compiler will optimize that away.
I was trying to guard against the scenario were code strings are 5 bytes long, and then it works on the 64bit build but not the 32bit build. Having an explicit call to sizeof was meant to be a reminder to the programmer not to go past their limit. That might be unnecessary/ineffective defensive programming.
It would probably be better to do memset(&view->internal, '\0', sizeof view->internal). That might get the same idea across and zero out the everything instead of the first 3 bytes.
| (np.float32, 4), | ||
| (np.float64, 8), | ||
| (np.complex64, 8), | ||
| (np.complex128, 16), |
There was a problem hiding this comment.
I think all of these would be better as np.dtype(type).itemsize, to avoid duplication. You could then use longdouble instead of float128 below, which is always defined.
There was a problem hiding this comment.
would type().itemsize be better than np.dtype(type).itemsize? Not sure if there is an appreciable difference.
| a = np.array([], dtype=scalar) | ||
| assert_(x.data.format == a.data.format) | ||
|
|
||
| @dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2") |
There was a problem hiding this comment.
Not sure this is true - can you use memoryview(x) to get hold of it?
There was a problem hiding this comment.
Oddly, no.
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x = np.int16(2)
>>> memoryview(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot make memory view because object does not have the buffer interface
>>> x.data
<read-only buffer for 0x7ff23b84d150, size -1, offset 0 at 0x7ff23b778430>
>>>
There was a problem hiding this comment.
Hmm, that's pretty odd, given it works fine on arrays
|
Turns out you can change endianess of scalars PEP3118 format string generation is in buffer.c. It uses in internal struct I'll look into unifying the format string generation now that I know non native endianness and standard size are possibilities. |
|
That's not a correct example: >>> np.int16(1).dtype.byteorder
'='
>>> np.int16(1).newbyteorder('>').dtype.byteorder
'='In both cases the scalar has native endianness - you're swapping the bytes, but doing so by changing the memory, not the interpretation. Note that on arrays |
|
Something to watch out for - the layout in memory of |
|
@eric-wieser This should cover the fixes you requested. format strings from scalar and ndarrays are constructed with the same code |
numpy/core/src/multiarray/buffer.c
Outdated
| err = _buffer_format_string(descr, &fmt, obj, NULL, NULL); | ||
| if(PyArray_IsScalar(obj, Generic)) { | ||
| Py_DECREF(descr); | ||
| } |
There was a problem hiding this comment.
Would be tidier to just add Py_INCREF(descr) above when you obtain this, so you can do this unconditionally
| if (descr->type_num == NPY_UNICODE) { | ||
| elsize >>= 1; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Pleased to see that you did this
eric-wieser
left a comment
There was a problem hiding this comment.
This looks pretty good - just some minor comments
| np.half, np.single, np.double, np.float_, np.longfloat, | ||
| np.float16, np.float32, np.float64, np.csingle, np.complex_, | ||
| np.clongfloat, np.complex64, np.complex128, | ||
| ] |
There was a problem hiding this comment.
This list is longer than it needs to be, as most of these are aliases. I'd recommend testing only:
[
np.bool,
np.byte, np.short, np.intc, np.int_, np.longlong,
np.ubyte, np.ushort, np.uintc, np.uint, npu.longlong,
np.half, np.single, np.double, np.longdouble,
np.csingle, np.cdouble, np.clongdouble,
]| # platform dependant dtypes | ||
| for dtype in ('float96', 'float128', 'complex192', 'complex256'): | ||
| if hasattr(np, dtype): | ||
| scalars.append(getattr(np, dtype)) |
There was a problem hiding this comment.
These are also only aliases, so there's no need to include them over the np.longdouble above.
|
|
||
|
|
||
| class TestScalarPEP3118(object): | ||
| @dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2") |
There was a problem hiding this comment.
I'm curious - can you do
skip_if_no_buffer_interface = dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2")
and then apply @skip_if_no_buffer_interface to each one? Or does it need a new instance each time?
| @dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2") | ||
| def test_void_scalar_structured_data(self): | ||
| dt = np.dtype([('name', np.unicode_, 16), ('grades', np.float64, (2,))]) | ||
| a = np.array([('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0))], dtype=dt) |
There was a problem hiding this comment.
You might want to consider making this an 0d array, as np.array(('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0)), dtype=dt)
There was a problem hiding this comment.
Not sure I follow, how would a 0d array help?
I'm trying to get an np.void object to test. Ideally I would directly make an np.void scalar, but I don't see how.
np.array(('Sarah', (8.0, 7.0)), dtype=dt) ndim is 0, but type is ndarray
There was a problem hiding this comment.
You can convert an 0d array to a scalar with [()], in place of [0]`.
You could then test that the buffer returned from array vs scalar has the same strides, shape, and format
There was a problem hiding this comment.
I would prefer to just change to:
x = np.array(('ndarray_scalar', (1.2, 3.0)), dtype=dt)[()]This way
xis definitely different thana, not just an element of that array.- Its as close to directly creating a structured
np.voidas I can get. - ndim/shape/strides/suboffsets are tested against explicit values per PEP3118, not just that they match corresponding buffer fields from ndarray
a
As a trade off, the format string of the scalar is tested against ndarray format string. Its simpler then testing against the multitude of different, but still valid, PEP3118 format strings for this structure.
|
|
||
| mv_a = memoryview(a) | ||
| assert_(mv_x.itemsize == mv_a.itemsize) | ||
| assert_(mv_x.format == mv_a.format) |
There was a problem hiding this comment.
Use assert_equal here and above to give a better error message
| for scalar, code in scalars_set_code: | ||
| x = scalar() | ||
| mv_x = memoryview(x) | ||
| assert_(mv_x.format == code) |
There was a problem hiding this comment.
This test makes me uneasy due to the difference between "standard" and "native" letter codes, but I suppose my concern is equally valid about the memoryview of arrrays - so I'm happy to keep it until we find a problem with it in a separate PR.
| assert_(isinstance(x, np.void)) | ||
|
|
||
| mv_x = memoryview(x) | ||
| expected_size = 16 * np.unicode_().__array__().itemsize |
There was a problem hiding this comment.
How about np.dtype((np.unicode, 1)).itemsize? What you have here might change to 0 in future.
|
|
||
| mv_x = memoryview(x) | ||
| expected_size = 16 * np.unicode_().__array__().itemsize | ||
| expected_size += 2 * np.float64().__array__().itemsize |
There was a problem hiding this comment.
np.dtype(np.float64).itemsize
numpy/core/src/multiarray/buffer.c
Outdated
| if (descr->byteorder == '=' && | ||
| _is_natively_aligned_at(descr, arr, *offset)) { | ||
| if(!PyArray_IsScalar(obj, Generic)) { | ||
| /* only check ndarrays, scalars are always natively aligned */ |
There was a problem hiding this comment.
This would be clearer if you assigned the is_natively_aligne = 1 in the else, and moved the comment there (or swapped the if and else
|
CI failure can be ignored, but will go away if you rebase on master |
numpy/core/src/multiarray/buffer.c
Outdated
| } | ||
| else { | ||
| int is_standard_size = 1; | ||
| int is_natively_aligned = 1; |
There was a problem hiding this comment.
No point initializing this any more
| np.csingle, np.cdouble, np.clongdouble, | ||
| ] | ||
|
|
||
| scalars_set_code = [ |
There was a problem hiding this comment.
Could do with a comment like
# PEP3118 format strings for native (standard alignment and byteorder) types
numpy/core/src/multiarray/buffer.c
Outdated
| { | ||
| _buffer_info_t *info = NULL; | ||
| PyArray_Descr *descr = NULL; | ||
| int elsize = 0; |
There was a problem hiding this comment.
No point initializing any of these either, IMO
eric-wieser
left a comment
There was a problem hiding this comment.
Two nits that my opinion might be worth ignoring on, and a request for a comment in the test.
numpy/core/src/multiarray/buffer.c
Outdated
|
|
||
| if (descr->byteorder == '=' && | ||
| _is_natively_aligned_at(descr, arr, *offset)) { | ||
| if(PyArray_IsScalar(obj, Generic)) { |
There was a problem hiding this comment.
super-nit: missing space after if
| (np.half, 'e'), | ||
| (np.single, 'f'), | ||
| (np.double, 'd'), | ||
| (np.float_, 'd'), |
There was a problem hiding this comment.
These two are aliases, so there's no point having both - just use np.double, and ditch np.float_
| (np.single, 'f'), | ||
| (np.double, 'd'), | ||
| (np.float_, 'd'), | ||
| (np.longfloat, 'g'), |
There was a problem hiding this comment.
This would normally be spelt np.longdouble
| (np.longfloat, 'g'), | ||
| (np.csingle, 'Zf'), | ||
| (np.complex_, 'Zd'), | ||
| (np.clongfloat, 'Zg'), |
There was a problem hiding this comment.
np.cdouble and np.clongdouble would be more consistent here.
| ] | ||
|
|
||
| # PEP3118 format strings for native (standard alignment and byteorder) types | ||
| scalars_set_code = [ |
There was a problem hiding this comment.
scalars_and_codes would be a better name
| from numpy.testing import run_module_suite, assert_, assert_equal, dec | ||
|
|
||
| # types | ||
| scalars = [ |
There was a problem hiding this comment.
I think you can ditch this, and just use
for scalar, _ in scalar_and_codes:
...in your tests below
|
Updated. |
| assert_(isinstance(x, np.void)) | ||
| mv_x = memoryview(x) | ||
| expected_size = 16 * np.dtype((np.unicode_, 1)).itemsize | ||
| expected_size += 2 * np.dtype((np.float64, 1)).itemsize |
There was a problem hiding this comment.
Deliberately not using expected_size = dt.itemsize?
There was a problem hiding this comment.
Yes. I'd like to test against an explicit values, rather than trusting that dt.itemsize is correct.
I do use dtype.itemsize of unicode_ and float64 when calculating expected_size because of platform dependant sizes of unicode_. But this is still one step more explicit than testing scalar void.itemsize == dtype void.itemsize
There was a problem hiding this comment.
The assertion memoryview of void itemsize, not scalar void itemsize - arguably all we care about is that the the itemsize of a memoryview and dtype match.
I don't feel too strongly about this though.
There was a problem hiding this comment.
because of platform dependant sizes of unicode_
Note that unicode as a dtype is always 4 byte-characters - you're thinking of unicode_ scalars, which are sometimes 2-byte, due to subclassing the builtin unicode.
|
Thanks for a great first contribution @vanossj! |
Fixes #10265
memory view of scalars will report correct format code as well as ndim==0 and empty shape/strides/suboffsets
Previous behaviour had memoryview of all scalars be an array of bytes.
Format code is stored in
Py_buffer.internal. This was a simple place to store that didn't require an allocation of memory. This does somewhat duplicate code in buffer.c. However scalars are a subset of dtypes, that are always native size and aligned and only a single element, and so the resulting code could be simplified to a simple switch statement.Tests check that scalar format code matches the format code of an empty
arraywith dtype matching the scalar.dtypes
intpanduintpare seem to be analogous to c pointers and according to PEP3118 should report a format code of 'p' or 'P' respectively. Howeverarrays of this dtype use the integer format code. This patch follows that precedent.