Allow cupy.ndarray as repeats argument to cupy.repeat#9828
Allow cupy.ndarray as repeats argument to cupy.repeat#9828
Conversation
cupy.repeat now accepts a CuPy ndarray for the repeats parameter, matching NumPy's API. Scalar-like inputs (size-1 or 0-D arrays) use an efficient broadcast-copy kernel. Multi-element inputs use cumsum + searchsorted + take on the GPU. List/tuple inputs with len > 1, previously O(n * kernel_launch), now use the same GPU path. numpy.ndarray inputs are rejected with TypeError (CuPy convention). Closes cupy#3849
seberg
left a comment
There was a problem hiding this comment.
Looks good in general, I have a lot of nitpicking comments, some can be ignored and/or are more general "could be a bit nicer". As usual, tests seem a bit verbose, but I don't mind much.
(I should look through at it once more, but thought I'd post.)
| faster than scatter-add + cumsum because CUDA atomic operations have | ||
| high per-operation overhead relative to parallel binary search. | ||
| """ | ||
| import cupy as _cupy |
There was a problem hiding this comment.
Seems a bit strange to do it here. Might as well do it top-level?
| if not numpy.issubdtype(reps.dtype, numpy.integer): | ||
| raise ValueError( | ||
| "'repeats' array must have integer dtype, got {}".format( | ||
| reps.dtype)) |
There was a problem hiding this comment.
I might suggest to just use reps = reps.astype(np.intp, casting="safe")
(safe is what NumPy seems to use, but same_kind actually makes more sense maybe, since it allows casting from uint64 even if shady in theory.)
That replaces the != intp check below and raises the error here also.
There was a problem hiding this comment.
But looking below, it also feels like defensive nonsense to do this here. At this point it should already be guaranteed a 1-D intp array.
| "'repeats' array must have integer dtype, got {}".format( | ||
| reps.dtype)) | ||
|
|
||
| reps = reps.ravel() |
There was a problem hiding this comment.
I don't think NumPy does this ravel? a bit unclear why it is here?
|
|
||
| axis = internal._normalize_axis_index(axis, a.ndim) |
There was a problem hiding this comment.
| axis = internal._normalize_axis_index(axis, a.ndim) | |
| else: | |
| axis = internal._normalize_axis_index(axis, a.ndim) |
| if reps_cpu.min() < 0: | ||
| raise ValueError( | ||
| "all elements of 'repeats' should not be negative") | ||
| total = int(reps_cpu.sum()) |
There was a problem hiding this comment.
Not sure I understand why we would do this on the CPU rather than GPU if we already made it a GPU array?
I mean, sure it may be pretty fast, but we do a cumsum below on the GPU, it seems like a weird mix of things.
Also, boundaries[-1] is also the total sum, so we don't have to do this twice.
| elif isinstance(repeats, int) or ( | ||
| hasattr(repeats, 'dtype') and | ||
| numpy.ndim(repeats) == 0 and | ||
| numpy.issubdtype(repeats.dtype, numpy.integer)): |
There was a problem hiding this comment.
I would suggest to just use:
try:
reps = operator.index()
except TypeError:
pass
Admittedly, that is a bit more broad but not much, so if you don't like it.
(I don't love try/excepts, but things like numpy.ndim and numpy.issubdtype are also all shady and slow)
This would let numpy ndarrays pass maybe...
| axis = 0 | ||
| # Non-broadcast sequence (including empty): GPU per-element path. | ||
| return _repeat_ndarray_repeats( | ||
| a, core.array(repeats, dtype=numpy.intp), axis) |
There was a problem hiding this comment.
Using core.array(repeats, dtype=numpy.intp) means we actually accept a list of floats indirectly.
This is maybe fine, not doing so is actually a bit tedious (you need to special case for len(repeats) == 0, and otherwise use numpy.asarray(), and then do a same_kind cast.
| """Make a reps array on the right platform.""" | ||
| if xp is cupy: | ||
| return cupy.array(vals) | ||
| return numpy.array(vals) |
There was a problem hiding this comment.
This function is just silly :), since it is just xp.array(vals)
| {'shape': (2, 3, 4, 5), 'reps': [2, 1, 3], 'axis': 1}, | ||
| {'shape': (2, 3, 4, 5), 'reps': [3], 'axis': 3}, | ||
| ) | ||
| class TestRepeatNdarrayRepeats(unittest.TestCase): |
There was a problem hiding this comment.
Let's get rid of unittest and use normal @pytest.parametrize, (all it does is not use shape as an argument rather than self.shape).
Just as a long-term hope to use a more typical pattern in general.
| """Make an integer reps array (int32 dtype), even for empty lists.""" | ||
| if xp is cupy: | ||
| return cupy.array(vals, dtype=numpy.int32) | ||
| return numpy.array(vals, dtype=numpy.int32) |
There was a problem hiding this comment.
Also feels like an unnecessary helper?
|
Thanks for the initial review and nits--keep 'em coming! Seem valid to me. I'll give my human replies on Monday |
cupy.repeat now accepts a CuPy ndarray for the repeats parameter, matching NumPy's API. Scalar-like inputs (size-1 or 0-D arrays) use an efficient broadcast-copy kernel. Multi-element inputs use cumsum + searchsorted + take on the GPU. List/tuple inputs with len > 1, previously O(n * kernel_launch), now use the same GPU path.
Closes #3849. There are probably good reasons why this addition hasn't been done before (there is a D2H transfer) given how long #3849 has been open for and how much interest there has been in it, so I'm curious to find out why! This feature has actually been on my wishlist for a few years too.
Claude Code (Opus 4.6) was used to assist in the development of this PR, but I was also heavily involved. These changes were actually spawned from my efforts to improve sparse matrix classes in #9825, and like in that PR I'm still doing my final human review of these changes.