Skip to content

BUG: Provide correct format in Py_buffer for scalars#10564

Merged
eric-wieser merged 10 commits intonumpy:masterfrom
vanossj:fix-pep3118-scalar-types
Mar 25, 2018
Merged

BUG: Provide correct format in Py_buffer for scalars#10564
eric-wieser merged 10 commits intonumpy:masterfrom
vanossj:fix-pep3118-scalar-types

Conversation

@vanossj
Copy link
Copy Markdown
Contributor

@vanossj vanossj commented Feb 9, 2018

Fixes #10265

memory view of scalars will report correct format code as well as ndim==0 and empty shape/strides/suboffsets

Previous behaviour had memoryview of all scalars be an array of bytes.

Format code is stored in Py_buffer.internal. This was a simple place to store that didn't require an allocation of memory. This does somewhat duplicate code in buffer.c. However scalars are a subset of dtypes, that are always native size and aligned and only a single element, and so the resulting code could be simplified to a simple switch statement.

Tests check that scalar format code matches the format code of an empty array with dtype matching the scalar.

dtypes intp and uintp are seem to be analogous to c pointers and according to PEP3118 should report a format code of 'p' or 'P' respectively. However arrays of this dtype use the integer format code. This patch follows that precedent.


# TODO: 'p', 'l', or 'q'?
# (np.intp, 'p'),
# (np.uintp, 'P'),
Copy link
Copy Markdown
Member

@eric-wieser eric-wieser Feb 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p won't be possible, because intp is just an alias for one of the other integer types - they're not distinguishable

Copy link
Copy Markdown
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't report the endianness correctly. Note that as soon as you add endianness markers, you also have to use standard length codes, rather than the native ones (use the ones based on sizeof not the type name).

This also seems to reimplement stuff that I we already have for ndarray. Can you work out some way to share that code? There's a dtype to PEP3118 converter somewhere, probably _internal.py

Also, tests for structured void scalars would be good.

@vanossj
Copy link
Copy Markdown
Contributor Author

vanossj commented Feb 10, 2018

Is there a way create a scalar that isn't native endianess?

@eric-wieser
Copy link
Copy Markdown
Member

Nice catch, maybe there isn't.


outcode = PyArray_DescrFromScalar(self);

if (flags & PyBUF_FORMAT && 3 <= sizeof view->internal ) {
Copy link
Copy Markdown
Member

@eric-wieser eric-wieser Feb 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parens around (a & b) would be nice.

Are you sure sizeof(view->internal) does anything useful here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently pointless, but the compiler will optimize that away.

I was trying to guard against the scenario were code strings are 5 bytes long, and then it works on the 64bit build but not the 32bit build. Having an explicit call to sizeof was meant to be a reminder to the programmer not to go past their limit. That might be unnecessary/ineffective defensive programming.

It would probably be better to do memset(&view->internal, '\0', sizeof view->internal). That might get the same idea across and zero out the everything instead of the first 3 bytes.

(np.float32, 4),
(np.float64, 8),
(np.complex64, 8),
(np.complex128, 16),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these would be better as np.dtype(type).itemsize, to avoid duplication. You could then use longdouble instead of float128 below, which is always defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would type().itemsize be better than np.dtype(type).itemsize? Not sure if there is an appreciable difference.

a = np.array([], dtype=scalar)
assert_(x.data.format == a.data.format)

@dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is true - can you use memoryview(x) to get hold of it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, no.

Python 2.7.14 (default, Sep 23 2017, 22:06:14) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x = np.int16(2)
>>> memoryview(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot make memory view because object does not have the buffer interface
>>> x.data
<read-only buffer for 0x7ff23b84d150, size -1, offset 0 at 0x7ff23b778430>
>>> 

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's pretty odd, given it works fine on arrays

@vanossj
Copy link
Copy Markdown
Contributor Author

vanossj commented Feb 10, 2018

Turns out you can change endianess of scalars
np.int16(1).newbyteorder('S')

PEP3118 format string generation is in buffer.c. It uses in internal struct _tmp_string_t and allocates memory, something I was hoping to avoid.

I'll look into unifying the format string generation now that I know non native endianness and standard size are possibilities.

@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Feb 10, 2018

That's not a correct example:

>>> np.int16(1).dtype.byteorder
'='
>>> np.int16(1).newbyteorder('>').dtype.byteorder
'='

In both cases the scalar has native endianness - you're swapping the bytes, but doing so by changing the memory, not the interpretation. .byteswap() does the same.

Note that on arrays .byteswap() swaps the bytes, but newbyteorder('S') changes the interpretation. On scalars, it seems that both mean byteswap().

@eric-wieser
Copy link
Copy Markdown
Member

_buffer_format_string is the function to look at.

Something to watch out for - the layout in memory of np.unicode_ objects on windows is not the same as the layout of arrays of these objects - one is UCS2, the other is UCS4.

@vanossj
Copy link
Copy Markdown
Contributor Author

vanossj commented Feb 23, 2018

@eric-wieser This should cover the fixes you requested. format strings from scalar and ndarrays are constructed with the same code

err = _buffer_format_string(descr, &fmt, obj, NULL, NULL);
if(PyArray_IsScalar(obj, Generic)) {
Py_DECREF(descr);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be tidier to just add Py_INCREF(descr) above when you obtain this, so you can do this unconditionally

if (descr->type_num == NPY_UNICODE) {
elsize >>= 1;
}
#endif
Copy link
Copy Markdown
Member

@eric-wieser eric-wieser Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleased to see that you did this

Copy link
Copy Markdown
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good - just some minor comments

np.half, np.single, np.double, np.float_, np.longfloat,
np.float16, np.float32, np.float64, np.csingle, np.complex_,
np.clongfloat, np.complex64, np.complex128,
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is longer than it needs to be, as most of these are aliases. I'd recommend testing only:

[
    np.bool,
    np.byte, np.short, np.intc, np.int_, np.longlong,
    np.ubyte, np.ushort, np.uintc, np.uint, npu.longlong,
    np.half, np.single, np.double, np.longdouble,
    np.csingle, np.cdouble, np.clongdouble,
]

# platform dependant dtypes
for dtype in ('float96', 'float128', 'complex192', 'complex256'):
if hasattr(np, dtype):
scalars.append(getattr(np, dtype))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also only aliases, so there's no need to include them over the np.longdouble above.



class TestScalarPEP3118(object):
@dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - can you do

skip_if_no_buffer_interface = dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2")

and then apply @skip_if_no_buffer_interface to each one? Or does it need a new instance each time?

@dec.skipif(sys.version_info.major < 3, "scalars do not implement buffer interface in Python 2")
def test_void_scalar_structured_data(self):
dt = np.dtype([('name', np.unicode_, 16), ('grades', np.float64, (2,))])
a = np.array([('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0))], dtype=dt)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider making this an 0d array, as np.array(('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0)), dtype=dt)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, how would a 0d array help?

I'm trying to get an np.void object to test. Ideally I would directly make an np.void scalar, but I don't see how.

np.array(('Sarah', (8.0, 7.0)), dtype=dt) ndim is 0, but type is ndarray

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can convert an 0d array to a scalar with [()], in place of [0]`.

You could then test that the buffer returned from array vs scalar has the same strides, shape, and format

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to just change to:

x = np.array(('ndarray_scalar', (1.2, 3.0)), dtype=dt)[()]

This way

  • x is definitely different than a, not just an element of that array.
  • Its as close to directly creating a structured np.void as I can get.
  • ndim/shape/strides/suboffsets are tested against explicit values per PEP3118, not just that they match corresponding buffer fields from ndarray a

As a trade off, the format string of the scalar is tested against ndarray format string. Its simpler then testing against the multitude of different, but still valid, PEP3118 format strings for this structure.


mv_a = memoryview(a)
assert_(mv_x.itemsize == mv_a.itemsize)
assert_(mv_x.format == mv_a.format)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert_equal here and above to give a better error message

for scalar, code in scalars_set_code:
x = scalar()
mv_x = memoryview(x)
assert_(mv_x.format == code)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test makes me uneasy due to the difference between "standard" and "native" letter codes, but I suppose my concern is equally valid about the memoryview of arrrays - so I'm happy to keep it until we find a problem with it in a separate PR.

assert_(isinstance(x, np.void))

mv_x = memoryview(x)
expected_size = 16 * np.unicode_().__array__().itemsize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about np.dtype((np.unicode, 1)).itemsize? What you have here might change to 0 in future.


mv_x = memoryview(x)
expected_size = 16 * np.unicode_().__array__().itemsize
expected_size += 2 * np.float64().__array__().itemsize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.dtype(np.float64).itemsize

if (descr->byteorder == '=' &&
_is_natively_aligned_at(descr, arr, *offset)) {
if(!PyArray_IsScalar(obj, Generic)) {
/* only check ndarrays, scalars are always natively aligned */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be clearer if you assigned the is_natively_aligne = 1 in the else, and moved the comment there (or swapped the if and else

@eric-wieser
Copy link
Copy Markdown
Member

CI failure can be ignored, but will go away if you rebase on master

}
else {
int is_standard_size = 1;
int is_natively_aligned = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point initializing this any more

np.csingle, np.cdouble, np.clongdouble,
]

scalars_set_code = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a comment like

# PEP3118 format strings for native (standard alignment and byteorder) types

{
_buffer_info_t *info = NULL;
PyArray_Descr *descr = NULL;
int elsize = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point initializing any of these either, IMO

Copy link
Copy Markdown
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits that my opinion might be worth ignoring on, and a request for a comment in the test.


if (descr->byteorder == '=' &&
_is_natively_aligned_at(descr, arr, *offset)) {
if(PyArray_IsScalar(obj, Generic)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: missing space after if

(np.half, 'e'),
(np.single, 'f'),
(np.double, 'd'),
(np.float_, 'd'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are aliases, so there's no point having both - just use np.double, and ditch np.float_

(np.single, 'f'),
(np.double, 'd'),
(np.float_, 'd'),
(np.longfloat, 'g'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would normally be spelt np.longdouble

(np.longfloat, 'g'),
(np.csingle, 'Zf'),
(np.complex_, 'Zd'),
(np.clongfloat, 'Zg'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.cdouble and np.clongdouble would be more consistent here.

]

# PEP3118 format strings for native (standard alignment and byteorder) types
scalars_set_code = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalars_and_codes would be a better name

from numpy.testing import run_module_suite, assert_, assert_equal, dec

# types
scalars = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can ditch this, and just use

for scalar, _ in scalar_and_codes:
   ...

in your tests below

@charris
Copy link
Copy Markdown
Member

charris commented Mar 19, 2018

Updated.

assert_(isinstance(x, np.void))
mv_x = memoryview(x)
expected_size = 16 * np.dtype((np.unicode_, 1)).itemsize
expected_size += 2 * np.dtype((np.float64, 1)).itemsize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberately not using expected_size = dt.itemsize?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'd like to test against an explicit values, rather than trusting that dt.itemsize is correct.
I do use dtype.itemsize of unicode_ and float64 when calculating expected_size because of platform dependant sizes of unicode_. But this is still one step more explicit than testing scalar void.itemsize == dtype void.itemsize

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion memoryview of void itemsize, not scalar void itemsize - arguably all we care about is that the the itemsize of a memoryview and dtype match.

I don't feel too strongly about this though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of platform dependant sizes of unicode_

Note that unicode as a dtype is always 4 byte-characters - you're thinking of unicode_ scalars, which are sometimes 2-byte, due to subclassing the builtin unicode.

@eric-wieser eric-wieser added this to the 1.15.0 release milestone Mar 20, 2018
@eric-wieser eric-wieser merged commit e4d678a into numpy:master Mar 25, 2018
@eric-wieser
Copy link
Copy Markdown
Member

Thanks for a great first contribution @vanossj!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants