ENH,BUG: Replace int with size_t in cub function calls#9810
ENH,BUG: Replace int with size_t in cub function calls#9810
int with size_t in cub function calls#9810Conversation
This replaces the `int` size related parameters with `size_t` parameters in CUB to remove call limitation and fix existing missing check. Commit created in part with cursor.
|
Nvm... I added a test and after improving it (the agent version wasn't very thorough) it now just fails and I noticed that |
|
OK, the issue was in the test itself so I'll give it one more chance and reject argmin/argmax for those paths. |
|
/test rocm |
There was a problem hiding this comment.
We need a performance study because CUB algorithms are extremely sensitive to the number of items (which is partly why it took a long time for large-size support to land in CUB). In the worst case scenario, we need to dispatch to two different kernel implementations and favor int over size_t unless num_items is too large, though it'd inflate the compile time and binary size, which we also need to study.
|
Hmmm, good point, locally for summing it seems to be identical (unless it is hardware dependent -- for arange, I guess there is a bigger chance of a difference, but it is unchanged for now). I'll try and confirm that CUB picks the best implementation at runtime Actual timings in details (first run is new branch, but it is identical either way -- confirmed on T400 and RTX PRO6000): DetailsIt seems to me that larger indices are implemented by launching smaller kernels for the larger sizes, rather than two kernel types: |
leofang
left a comment
There was a problem hiding this comment.
Approving since I will get disconnected soon and I don't wanna block this from merging 😛 but before merging it'd better to check perf regression on data center GPUs like A100, H100, and B100, because it's where CUB tuning really shines.
|
btw when testing the perf we show sweep through input sizes in log scale (10^2, 10^3, ...) because CUB performs differently in different regions. |
Enable native cuSPARSE SpGEMM for int64 on CUDA 13.0+. The cuSPARSE version (12709) is the same for CUDA 12.7 and 13.0, so dispatch checks the CUDA runtime version instead. Result matrices use _from_parts to preserve int64 index dtype. Template dense2csr kernels for int64 (B3): _d2c_atomic_add helper for int64 atomicAdd, remove ValueError guard for large shapes. Add _cumsum_int64 workaround for cupy.cumsum silent failure on arrays >= 2^32 elements (CUB scan wrapper int overflow; fix in PR cupy#9810). Used in _build_indptr and dense2csr. Fix COO transpose falsely propagating has_canonical_format (caused 54 CI failures in LSMR/SVDs). Fix bmat crash on mixed sparse+dense blocks. Fix _maximum_minimum, setdiag, _major_slice losing int64 dtype or flags. Move int64 routing before check_availability in CSR/CSC conversion functions. Convert 37 bare asserts to ValueError/TypeError in cusparse.py. Add early-switch heuristic in random() based on allocation size. Add 17 regression tests.
This replaces the
intsize related parameters withsize_tparameters in CUB to remove call limitation and fix existing missing check.Commit created in part with cursor.
Closes gh-9780, note: if there is any uncertainty here or seems nontrivial to review, as discussed elsewhere: we can just close this and I'll instead do the one-line fix from the issue. (the reasoning is that mid to long-term we'll want a more comprehensive fix anyway)