-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
BUG: Fix np.strings.slice if stop=None or start and stop >= len #29944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
db9748c to
3998c3b
Compare
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly curious if the second branch is correct, but I think it should be since it iterates through one by one so that the iteration stopping criteria encodes the incorrect logic.
Thanks for looking into both issues! I wonder if the None change should be mentioned, but overall it seems clear enough (and this function is still fairly new also anyway).
Python treats `slice(-1)` differently from `slice(-1, None)`:
The first is interpreted as `slice(None, -1, None)`, while the second
becomes `slice(-1, None, None)`, according to the logic in `slice_new`.
However, `np.strings.slice` treats these identically, as it cannot
distinguish unset arguments from arguments set to None. This makes it
impossible to get the last characters of each string, for example:
```python
>>> a = np.array(['hello', 'world'])
>>> np.strings.slice(a, -2, None) # should return last two characters
array(['hel', 'wor'], dtype='<U5')
```
This commit fixes that behavior:
```python
>>> a = np.array(['hello', 'world'])
>>> np.strings.slice(a, -2, None) # returns last characters as expected
array(['lo', 'ld'], dtype='<U5')
>>> np.strings.slice(a, -2) # original behavior preserved if no stop
array(['hel', 'wor'], dtype='<U5')
```
It does this by adding a `stop=np._NoValue` default argument to
`np.strings.slice`, which can be overridden with `None`.
This commit also adds test conditions to
`numpy/_core/tests/test_strings.py::TestMethods::test_slice`
to verify that the slicing behavior matches Python's `slice`.
Note that 4 newly added test conditions are commented out for now,
as they cause errors with the "T" dtype. To reproduce:
```
>>> np.__version__
'2.3.3'
>>> a = np.array(['hello', 'world'], dtype="T")
>>> np.strings.slice(a, 5, 7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "numpy-dev/lib/python3.12/site-packages/numpy/_core/strings.py", line 1823, in slice
return _slice(a, start, stop, step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: Failed to allocate string in slice
```
This causes either a MemoryError or kills the process with code 251.
Allows commented test_slice conditions to be uncommented.
3998c3b to
7b9c5fc
Compare
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @aaronkollasch. I think this is good to go in if tests pass (I assume they will).
|
Sounds great, thanks for reviewing @seberg! |
…y#29944) Python treats `slice(-1)` differently from `slice(-1, None)`: The first is interpreted as `slice(None, -1, None)`, while the second becomes `slice(-1, None, None)`, according to the logic in `slice_new`. However, `np.strings.slice` treats these identically, as it cannot distinguish unset arguments from arguments set to None. This makes it impossible to get the last characters of each string, for example: ```python >>> a = np.array(['hello', 'world']) >>> np.strings.slice(a, -2, None) # should return last two characters array(['hel', 'wor'], dtype='<U5') ``` This commit fixes that behavior: ```python >>> a = np.array(['hello', 'world']) >>> np.strings.slice(a, -2, None) # returns last characters as expected array(['lo', 'ld'], dtype='<U5') >>> np.strings.slice(a, -2) # original behavior preserved if no stop array(['hel', 'wor'], dtype='<U5') ``` It does this by adding a `stop=np._NoValue` default argument to `np.strings.slice`, which can be overridden with `None`. This commit also adds test conditions to `numpy/_core/tests/test_strings.py::TestMethods::test_slice` to verify that the slicing behavior matches Python's `slice`. Note that 4 newly added test conditions are commented out for now, as they cause errors with the "T" dtype. To reproduce: ``` >>> np.__version__ '2.3.3' >>> a = np.array(['hello', 'world'], dtype="T") >>> np.strings.slice(a, 5, 7) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "numpy-dev/lib/python3.12/site-packages/numpy/_core/strings.py", line 1823, in slice return _slice(a, start, stop, step) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ MemoryError: Failed to allocate string in slice ``` This causes either a MemoryError or kills the process with code 251. * BUG: Fix np.strings.slice when start and stop >= len Allows commented test_slice conditions to be uncommented.
|
Thanks so much for the fix and for expanding the tests! |
BUG: Fix np.strings.slice if stop=None or start and stop >= len (#29944)
…y#29944) Python treats `slice(-1)` differently from `slice(-1, None)`: The first is interpreted as `slice(None, -1, None)`, while the second becomes `slice(-1, None, None)`, according to the logic in `slice_new`. However, `np.strings.slice` treats these identically, as it cannot distinguish unset arguments from arguments set to None. This makes it impossible to get the last characters of each string, for example: ```python >>> a = np.array(['hello', 'world']) >>> np.strings.slice(a, -2, None) # should return last two characters array(['hel', 'wor'], dtype='<U5') ``` This commit fixes that behavior: ```python >>> a = np.array(['hello', 'world']) >>> np.strings.slice(a, -2, None) # returns last characters as expected array(['lo', 'ld'], dtype='<U5') >>> np.strings.slice(a, -2) # original behavior preserved if no stop array(['hel', 'wor'], dtype='<U5') ``` It does this by adding a `stop=np._NoValue` default argument to `np.strings.slice`, which can be overridden with `None`. This commit also adds test conditions to `numpy/_core/tests/test_strings.py::TestMethods::test_slice` to verify that the slicing behavior matches Python's `slice`. Note that 4 newly added test conditions are commented out for now, as they cause errors with the "T" dtype. To reproduce: ``` >>> np.__version__ '2.3.3' >>> a = np.array(['hello', 'world'], dtype="T") >>> np.strings.slice(a, 5, 7) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "numpy-dev/lib/python3.12/site-packages/numpy/_core/strings.py", line 1823, in slice return _slice(a, start, stop, step) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ MemoryError: Failed to allocate string in slice ``` This causes either a MemoryError or kills the process with code 251. * BUG: Fix np.strings.slice when start and stop >= len Allows commented test_slice conditions to be uncommented.
Python treats
slice(-1)differently fromslice(-1, None): The first is interpreted asslice(None, -1, None), while the second becomesslice(-1, None, None), according to the logic inslice_new.However,
np.strings.slicetreats these identically, as it cannot distinguish unset arguments from arguments set to None. This makes it impossible to get the last characters of each string, for example:This PR fixes that behavior:
It does this by adding a
stop=np._NoValuedefault argument tonp.strings.slice, which can be overridden withNone.The PR adds test conditions to
numpy/_core/tests/test_strings.py::TestMethods::test_slice, to verify that the slicing behavior matches Python'sslicewith these arguments.It also fixes an error when start and stop are >= the string length and the dtype is
StringDType(). To reproduce:When running
spin ipythonbased on main, these commands cause the python process to exit with code 251.To fix this, I added bounds checks to
codepoint_offsets[start]innumpy/_core/src/umath/stringdtype_ufuncs.cpp.