ENH,BUG: Replace `int` with `size_t` in cub function calls by seberg · Pull Request #9810 · cupy/cupy

seberg · 2026-03-18T14:40:41Z

This replaces the int size related parameters with size_t parameters in CUB to remove call limitation and fix existing missing check.

Commit created in part with cursor.

Closes gh-9780, note: if there is any uncertainty here or seems nontrivial to review, as discussed elsewhere: we can just close this and I'll instead do the one-line fix from the issue. (the reasoning is that mid to long-term we'll want a more comprehensive fix anyway)

This replaces the `int` size related parameters with `size_t` parameters in CUB to remove call limitation and fix existing missing check. Commit created in part with cursor.

seberg · 2026-03-18T15:13:11Z

Nvm... I added a test and after improving it (the agent version wasn't very thorough) it now just fails and I noticed that argmin/argmax also needs fixing probably.
So, I think I'll go with the minimal version for now (although... hopefully the new tests pass that)

seberg · 2026-03-18T15:59:07Z

OK, the issue was in the test itself so I'll give it one more chance and reject argmin/argmax for those paths.
The tests won't run in CI to a degree, they just need too much memory. But are passing locally (they are pretty slow though).

…argmax)

seberg · 2026-03-18T19:37:11Z

/test rocm

leofang

We need a performance study because CUB algorithms are extremely sensitive to the number of items (which is partly why it took a long time for large-size support to land in CUB). In the worst case scenario, we need to dispatch to two different kernel implementations and favor int over size_t unless num_items is too large, though it'd inflate the compile time and binary size, which we also need to study.

seberg · 2026-03-19T10:14:19Z

Hmmm, good point, locally for summing it seems to be identical (unless it is hardware dependent -- for arange, I guess there is a bigger chance of a difference, but it is unchanged for now). I'll try and confirm that CUB picks the best implementation at runtime

Actual timings in details (first run is new branch, but it is identical either way -- confirmed on T400 and RTX PRO6000):

Details

In [3]: import cupy as cp

In [4]: a = cp.arange(1000000.)

In [5]: %timeit a.sum()
10.2 μs ± 0.455 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [6]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.544 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.19 μs ± 0.514 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit a.sum()
10.2 μs ± 0.44 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.527 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [10]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.19 μs ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [11]:                                                                                                                          
Do you really want to exit ([y]/n)? 
(cupy-dev) sebastianb@sebastianb-workstation:~/forks/cupy/tests$ ipython
Python 3.14.3 | packaged by conda-forge | (main, Feb  9 2026, 21:56:02) [GCC 14.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.9.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use `%hist` to view history, see the options with `%history?`

In [1]: import cupy as cp

In [2]: a = cp.arange(1000000.)

In [3]: %timeit a.sum()
10.2 μs ± 0.453 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.535 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.2 μs ± 0.541 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It seems to me that larger indices are implemented by launching smaller kernels for the larger sizes, rather than two kernel types:
https://github.com/NVIDIA/cccl/pull/3764/changes#diff-7c890d0fb2241798e5e82f69cfc78e3308e2830ebcc13f5cec23f0c0a4785d3eR831

leofang

Approving since I will get disconnected soon and I don't wanna block this from merging 😛 but before merging it'd better to check perf regression on data center GPUs like A100, H100, and B100, because it's where CUB tuning really shines.

leofang · 2026-03-28T15:10:30Z

btw when testing the perf we show sweep through input sizes in log scale (10^2, 10^3, ...) because CUB performs differently in different regions.

Enable native cuSPARSE SpGEMM for int64 on CUDA 13.0+. The cuSPARSE version (12709) is the same for CUDA 12.7 and 13.0, so dispatch checks the CUDA runtime version instead. Result matrices use _from_parts to preserve int64 index dtype. Template dense2csr kernels for int64 (B3): _d2c_atomic_add helper for int64 atomicAdd, remove ValueError guard for large shapes. Add _cumsum_int64 workaround for cupy.cumsum silent failure on arrays >= 2^32 elements (CUB scan wrapper int overflow; fix in PR cupy#9810). Used in _build_indptr and dense2csr. Fix COO transpose falsely propagating has_canonical_format (caused 54 CI failures in LSMR/SVDs). Fix bmat crash on mixed sparse+dense blocks. Fix _maximum_minimum, setdiag, _major_slice losing int64 dtype or flags. Move int64 routing before check_availability in CSR/CSC conversion functions. Convert 37 bare asserts to ValueError/TypeError in cusparse.py. Add early-switch heuristic in random() based on allocation size. Add 17 regression tests.

ENH,BUG: Replace int with size_t in cub function calls

40cc56f

This replaces the `int` size related parameters with `size_t` parameters in CUB to remove call limitation and fix existing missing check. Commit created in part with cursor.

seberg requested a review from a team as a code owner March 18, 2026 14:40

seberg closed this Mar 18, 2026

seberg added 2 commits March 18, 2026 16:14

Fixup test

ea266af

Still reject argmin/argmax if reduction is too large and improve tsets

fa48815

seberg reopened this Mar 18, 2026

seberg added 2 commits March 18, 2026 17:34

Counting iterator was wrong and that part not tested yet

304d1a8

see if casting num items helps for rocm (they are integer for argmin/…

688c1e7

…argmax)

leofang reviewed Mar 18, 2026

View reviewed changes

leofang self-assigned this Mar 18, 2026

leofang mentioned this pull request Mar 27, 2026

cupy.cumsum silently fails on > 2^31 elements #9841

Open

leofang approved these changes Mar 28, 2026

View reviewed changes

leofang added cat:bug Bugs to-be-backported Pull-requests to be backported to stable branch labels Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH,BUG: Replace `int` with `size_t` in cub function calls#9810

ENH,BUG: Replace `int` with `size_t` in cub function calls#9810
seberg wants to merge 5 commits intocupy:mainfrom
seberg:cub-relax-int

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

leofang left a comment •

edited

Loading

Uh oh!

seberg commented Mar 19, 2026 •

edited

Loading

Uh oh!

leofang left a comment

Uh oh!

leofang commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

seberg commented Mar 18, 2026

Uh oh!

leofang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leofang left a comment •

edited

Loading

seberg commented Mar 19, 2026 •

edited

Loading