Skip to content

ENH,BUG: Replace int with size_t in cub function calls#9810

Open
seberg wants to merge 5 commits intocupy:mainfrom
seberg:cub-relax-int
Open

ENH,BUG: Replace int with size_t in cub function calls#9810
seberg wants to merge 5 commits intocupy:mainfrom
seberg:cub-relax-int

Conversation

@seberg
Copy link
Copy Markdown
Member

@seberg seberg commented Mar 18, 2026

This replaces the int size related parameters with size_t parameters in CUB to remove call limitation and fix existing missing check.

Commit created in part with cursor.


Closes gh-9780, note: if there is any uncertainty here or seems nontrivial to review, as discussed elsewhere: we can just close this and I'll instead do the one-line fix from the issue. (the reasoning is that mid to long-term we'll want a more comprehensive fix anyway)

This replaces the `int` size related parameters with `size_t` parameters
in CUB to remove call limitation and fix existing missing check.

Commit created in part with cursor.
@seberg seberg requested a review from a team as a code owner March 18, 2026 14:40
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 18, 2026

Nvm... I added a test and after improving it (the agent version wasn't very thorough) it now just fails and I noticed that argmin/argmax also needs fixing probably.
So, I think I'll go with the minimal version for now (although... hopefully the new tests pass that)

@seberg seberg closed this Mar 18, 2026
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 18, 2026

OK, the issue was in the test itself so I'll give it one more chance and reject argmin/argmax for those paths.
The tests won't run in CI to a degree, they just need too much memory. But are passing locally (they are pretty slow though).

@seberg seberg reopened this Mar 18, 2026
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 18, 2026

/test rocm

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a performance study because CUB algorithms are extremely sensitive to the number of items (which is partly why it took a long time for large-size support to land in CUB). In the worst case scenario, we need to dispatch to two different kernel implementations and favor int over size_t unless num_items is too large, though it'd inflate the compile time and binary size, which we also need to study.

@leofang leofang self-assigned this Mar 18, 2026
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 19, 2026

Hmmm, good point, locally for summing it seems to be identical (unless it is hardware dependent -- for arange, I guess there is a bigger chance of a difference, but it is unchanged for now). I'll try and confirm that CUB picks the best implementation at runtime

Actual timings in details (first run is new branch, but it is identical either way -- confirmed on T400 and RTX PRO6000):

Details
In [3]: import cupy as cp

In [4]: a = cp.arange(1000000.)

In [5]: %timeit a.sum()
10.2 μs ± 0.455 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [6]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.544 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.19 μs ± 0.514 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit a.sum()
10.2 μs ± 0.44 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.527 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [10]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.19 μs ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [11]:                                                                                                                          
Do you really want to exit ([y]/n)? 
(cupy-dev) sebastianb@sebastianb-workstation:~/forks/cupy/tests$ ipython
Python 3.14.3 | packaged by conda-forge | (main, Feb  9 2026, 21:56:02) [GCC 14.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.9.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use `%hist` to view history, see the options with `%history?`

In [1]: import cupy as cp

In [2]: a = cp.arange(1000000.)

In [3]: %timeit a.sum()
10.2 μs ± 0.453 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit a.reshape(1000, 1000).sum(axis=0)
12.3 μs ± 0.535 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit a.reshape(1000, 1000).sum(axis=1)
8.2 μs ± 0.541 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It seems to me that larger indices are implemented by launching smaller kernels for the larger sizes, rather than two kernel types:
https://github.com/NVIDIA/cccl/pull/3764/changes#diff-7c890d0fb2241798e5e82f69cfc78e3308e2830ebcc13f5cec23f0c0a4785d3eR831

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since I will get disconnected soon and I don't wanna block this from merging 😛 but before merging it'd better to check perf regression on data center GPUs like A100, H100, and B100, because it's where CUB tuning really shines.

@leofang leofang added cat:bug Bugs to-be-backported Pull-requests to be backported to stable branch labels Mar 28, 2026
@leofang
Copy link
Copy Markdown
Member

leofang commented Mar 28, 2026

btw when testing the perf we show sweep through input sizes in log scale (10^2, 10^3, ...) because CUB performs differently in different regions.

eriknw added a commit to eriknw/cupy that referenced this pull request Mar 30, 2026
Enable native cuSPARSE SpGEMM for int64 on CUDA 13.0+. The
cuSPARSE version (12709) is the same for CUDA 12.7 and 13.0, so
dispatch checks the CUDA runtime version instead. Result matrices
use _from_parts to preserve int64 index dtype.

Template dense2csr kernels for int64 (B3): _d2c_atomic_add helper
for int64 atomicAdd, remove ValueError guard for large shapes.

Add _cumsum_int64 workaround for cupy.cumsum silent failure on
arrays >= 2^32 elements (CUB scan wrapper int overflow; fix in
PR cupy#9810). Used in _build_indptr and dense2csr.

Fix COO transpose falsely propagating has_canonical_format (caused
54 CI failures in LSMR/SVDs). Fix bmat crash on mixed sparse+dense
blocks. Fix _maximum_minimum, setdiag, _major_slice losing int64
dtype or flags. Move int64 routing before check_availability in
CSR/CSC conversion functions.

Convert 37 bare asserts to ValueError/TypeError in cusparse.py.
Add early-switch heuristic in random() based on allocation size.
Add 17 regression tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:bug Bugs to-be-backported Pull-requests to be backported to stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Illegal memory access in cupy.sum (cupy 13.6.0)

2 participants