Skip to content

Add int64 index support to cupyx.scipy.sparse#9825

Open
eriknw wants to merge 48 commits intocupy:mainfrom
eriknw:int64_sparse_indices
Open

Add int64 index support to cupyx.scipy.sparse#9825
eriknw wants to merge 48 commits intocupy:mainfrom
eriknw:int64_sparse_indices

Conversation

@eriknw
Copy link
Copy Markdown

@eriknw eriknw commented Mar 23, 2026

Closes #3513.

Sparse matrices (CSR, CSC, COO) now support int64 indices/indptr/row/col, matching scipy.sparse semantics. Index dtype is chosen automatically via get_index_dtype(check_contents=True) — int32 when values fit, int64 when they don't. All existing int32 behavior is preserved.

Matrices with dimensions or nnz exceeding ~2.1 billion (INT32_MAX) previously failed silently or raised unhelpful errors. This is a prerequisite for large-scale sparse workloads (e.g., single-cell genomics via rapids-singlecell, large graph analytics). There is already a companion PR that uses this branch: scverse/rapids-singlecell#620

Note that this PR was made with assistance of Claude Code (models: Sonnet 4.6, Opus 4.6). I have been heavily involved every step of the way, which I hope is clear from the level of hardening already present. Nevertheless, as is typical with AI-assisted work, I am part author and part reviewer, and I am still reviewing these changes.

How

  • cuSPARSE Generic API (spmv, spmm, spGEMM) already supports int64 via SpMatDescr — no changes needed for these paths
  • Pure-CuPy fallbacks for Legacy API functions that are int32-only (csr2cscEx2, xcoo2csr, xcsrsort, csrgeam2, etc.) using searchsorted, lexsort, and cumsum
  • CUDA kernel templating — all inline kernels that used int* or int32 parameters now use template parameter I
  • _from_parts() classmethods on CSR/CSC/COO for internal construction without check_contents downcast, replacing ad-hoc object.__new__ patterns
  • Shared helpers (_indptr_to_coo, _build_indptr) consolidate repeated patterns

Also fixed along the way: float32 precision bug in __setitem__ for nnz > 16M, several int32 overflow issues in binopt/setdiag/DIA kernels that affected large matrices regardless of index dtype, and random()/rand() O(m*n) memory usage for large shapes (noted by @samaid in #3513).

What remains

  • Request updates to the cuSPARSE generic API so that we would need fewer pure-CuPy fallback implementations
  • Run everything and experiment with a larger, newer GPU. Current development has been done on Tesla V100-SXM2-32GB with CUDA 12.8, but since the scope is also to support large sparse arrays that need more than 32 GB, some things need a beefier GPU
  • Verify SpGEMM native int64 path on CUDA 13.0+
  • Activate SpGEAM path when it ships in a public CUDA release. I would also like to test this locally
  • dense2csr bool-dtype fallback still int32 (guarded with ValueError, medium-effort fix)
  • cupy.repeat upstream fixes would simplify several workarounds (but the workaround are fast, and not too bad). I plan to create a PR that support cupy.repeat with cupy arrays as repeats argument (I am nearly finished with this)

Future work is to add support for sparse arrays (not matrix classes). scipy.sparse semantics for sparse arrays is to s
imply use the dtype of the indices that the user provided (and then if everything isn't int32, then upcast to int64 as appropriate). Supporting mixed dtypes for indices/indptr/row/col for sparse matrix classes is something we could consider, but I would prefer to do that in another PR. Also note that sparse matrix classes are being deprecated in scipy and may be removed in 2027.

This is a large PR. It's also my first major contribution to CuPy. I can break this into multiple PRs if it would help. I value feedback.

Here is a review by Claude Code that explains some of the implementation details: line_by_line_full.md

eriknw added 30 commits March 16, 2026 15:25
Previously, all sparse matrix constructors hardcoded `astype('i')` (int32)
for index arrays, silently truncating any index value > INT32_MAX.

This change mirrors scipy's behavior: use int32 when all index values fit,
promote to int64 when they don't (via `get_index_dtype(check_contents=True)`).

Changes:
- `_coo.py`: shape-branch and final cast use `get_index_dtype`; scipy import
  preserves source dtype instead of forcing int32
- `_compressed.py`: all six construction branches compute `idx_dtype`
  (CuPy sparse, shape, scipy sparse, COO tuple, data/indices/indptr tuple,
  dense); final `astype('i')` replaced with `astype(idx_dtype)`
- `_construct.py`: `eye()` fast path uses `get_index_dtype(maxval=max(m,n))`
  for CSR, CSC, and COO formats

Safety guards (prevent silent corruption with int32-only cuSPARSE routines):
- `cusparse.csrilu02`: raise ValueError for int64 indices
- `cusparse.csrsm2`: raise ValueError for int64 indices
- `cusparse.csr2coo`: raise ValueError for int64 indptr (xcsr2coo reads
  indptr as int32, producing wrong row indices silently)
- `linalg.spsolve`: raise ValueError for int64 indices

Operations not yet supported with int64:
- `tocsr()` from COO (coosort/coo2csr are int32-only)
- `tocsc()` from CSR (csr2csc is int32-only)
- `tocoo()` from CSR (csr2coo guard raises until Phase 2)
- `sort_indices`, `sum_duplicates`, sparse add/multiply (later phases)
cuSPARSE's conversion/sort functions (xcoo2csr, xcsr2coo, csr2cscEx2,
xcsrsort, xcoosortBy*) only support int32 indices. Add pure-CuPy
fallbacks that activate when indices.dtype == int64:

- coo2csr / coo2csc: unique+scatter builds indptr without bincount
  (bincount would require two simultaneous ~17GB allocations on 32GB GPUs)
- csr2coo / csc2coo: searchsorted expands indptr → row/col array
  (cupy.repeat with ndarray repeats raises ValueError)
- csr2csc / csc2csr: new _cupy_csr2csc_int64 / _cupy_csc2csr_int64
  helpers using unique+scatter + searchsorted + lexsort
- csrsort / cscsort / coosort: lexsort-based fallback
- check_availability guards moved after int64 checks so pure-CuPy paths
  work on CUDA versions where legacy sort/coo functions are unavailable

Fix COO sum_duplicates ElementwiseKernels: hardcoded int32 src_row/
src_col/index silently truncated int64 values. Replace with generic
type I in all three kernel variants (bool, float, complex).
- csrgeam2: route int64 to pure-CuPy fallback (_cupy_csrgeam_int64)
  before the cuSPARSE availability check, matching the previous pattern
  applied to csrsort/coosort/csr2coo
- spgeam: add Generic API wrapper for cusparseSpGEAM via SoftLink +
  inline C forward declarations (absent from all public CUDA releases
  through 12.7.9; pure-CuPy fallback always active today)
- spgemm: fix result index dtype to use result_type(a, b) instead of
  hardcoded int32; note cuSPARSE SpGEMM rejects int64 at runtime
- idx_dtype: use numpy.result_type(a.indices.dtype, b.indices.dtype)
  throughout csrgeam2/spgeam/spgemm/fallback to handle mixed-dtype inputs
_index.py:
- _compress_getitem_kern/_compress_getitem_complex_kern: change `int32 minor`
  to `S minor` so scalar lookup (m[i, j]) works for column indices > INT32_MAX
- _csr_row_index_ker: change all `int32` to `I` so fancy row indexing
  (m[[r1,r2], :]) preserves int64 column indices
- _csr_row_index: cast `rows` to Ap.dtype for consistent kernel type param
- _csr_indptr_to_coo_rows: add int64 fallback via cupy.searchsorted since
  xcsr2coo (cuSPARSE) is int32-only

_compressed.py:
- _argmax_argmin_code: add TI template parameter for index/indptr types so
  argmax(axis=N) returns correct column coordinates for int64 matrices
- Expand name_expressions from 4 to 8 entries (add int64 index variants)
- _arg_minor_reduce: cast indptr slices to indices.dtype; use idx_dtype.type()
  for length parameter instead of always int64
… kernels

- eliminate_zeros: pure-CuPy fallback for int64 (searchsorted + unique/scatter)
  because cusparse.csr2csr_compress is int32-only
- _max_min_reduction_code: template<typename TI> + RawModule name_expressions,
  replacing RawKernel with hardcoded int* indptr parameters
- _GET_ROW_ID_ / _FIND_INDEX_HOLDING_COL_IN_ROW_ preambles: template<typename I>
  with _atomic_add_one helper (CUDA atomicAdd has no long long* overload)
- 4 kernel bodies (cupy_multiply_by_dense, cupy_multiply_by_csr_step{1,2},
  _cupy_csr2dense, _cupy_csr_diagonal): int locals → I; shape params int32 M/N → I
  (shape params must also be I: int32 params raise OverflowError for shape > INT32_MAX)
- spgemm: explicit ValueError for int64 (Generic API rejects int64 at runtime)
- Tests: TestInt64EliminateZeros, TestInt64Multiply, TestInt64Diagonal (11 tests)
- Move spgemm int64 guard before has_canonical_format assert and shape
  check (fires on any int64 CSR; previously required compatible shapes)
- TestInt64MinMaxReduction: 4 tests for _minor_reduce with int64 indptr
  (axis=1 max/min, axis=0 on CSC, int32 regression)
- TestInt64Toarray: 3 tests for _cupy_csr2dense int64 shape params
  (@testing.slow OOM-guarded large-shape tests + int32 regression)
- test_spgemm_rejects_int64: verifies clear ValueError for int64 inputs
Five independent `check_contents=True` downcast barriers in the
vstack/hstack/bmat/tocsr pipeline caused int64 index arrays to silently
revert to int32. Fix all five:

- `_compressed_sparse_stack`: bypass tuple-3 CSR/CSC constructor (use
  shape-only constructor + direct attribute assignment)
- `bmat`: use flat-index product heuristic (nrows*ncols-1) instead of
  max(shape) for idx_dtype, so large-product matrices use int64; bypass
  COO tuple-2 constructor for the same reason
- `coo_matrix._with_data`: bypass tuple-2 constructor to preserve index
  dtype through copy(), abs(), astype(), scalar multiply, etc.
- `_compressed._with_data`: same fix for CSR/CSC operations
- `coo2csr` / `coo2csc`: bypass tuple-3 CSR/CSC constructor in return
Add TestInt64DtypePreservation class (15 tests) to
test_sparse_int64_indices.py covering the check_contents=True
downcast barriers fixed: copy/abs/neg/scalar-mul/astype
on CSR and CSC, copy on COO with has_canonical_format propagation,
vstack, hstack (flat-index product heuristic), coo2csr/coo2csc
bypass, and int32 regressions. All tests use small index values
that fit in int32 — the scenario previously broken.
_minor_index_fancy() previously used a histogram kernel that allocated
O(N) memory (where N is the minor-axis size), causing OOM for int64
matrices where N > INT32_MAX. The kernels also silently truncated
index values > INT32_MAX via const int* parameters.

Add _minor_index_fancy_sorted(), a sort-based binary-search algorithm
that routes int64 matrices through argsort + searchsorted instead of
the histogram. Time: O((nnz + k) log(nnz + k)); space: O(nnz + k)
where k is the number of requested indices. Works for both CSR column
indexing and CSC row indexing. The int32 kernel path is unchanged.
Ten tests for _minor_index_fancy_sorted in TestInt64FancyMinorIndex:
large column/row values, multiple columns, duplicates, absent entries,
complex data, CSC symmetry, unsorted source, and an int32 regression
check (int32 kernel path must remain unchanged).
cuSPARSE spGEMM (Generic API) returns CUSPARSE_STATUS_NOT_SUPPORTED for
any int64 index descriptor on all public releases through 12.7.9, despite
advertising int64 support via SpMatDescr.

Add _cupy_spgemm_int64 to cusparse.py: a sort-merge SpGEMM that generates
all partial products (i, j, a_ik*b_kj) as a flat COO array, then merges
via sum_duplicates. Uses cumsum+searchsorted to expand indptr and repeat
counts (cupy.repeat rejects CuPy-ndarray repeat-counts). Both input index
arrays are normalised to int64 at entry; P-sized temporaries are freed
explicitly before sum_duplicates allocates its own scratch space.

Sort-merge is O(P log P) time and O(P+nnz_C) space, where P is total
partial products. Gustavson's row-wise algorithm (O(N) dense workspace
per row) is infeasible for int64 where N can exceed 2^31.

Dispatch in spgemm routes int64 inputs before check_availability so the
fallback works on all CUDA versions. A forward-compatible path is wired
via check_availability('spgemm_int64') (threshold 99000, no public release
has it yet): when native int64 spgemm ships, updating that version number
activates the cuSPARSE path automatically. Any int32 input is upcasted to
int64 via _with_indices_dtype (shares data array) so cuSPARSE receives
uniform int64 as required. The int32 cuSPARSE path is unchanged.

Also add _with_indices_dtype helper (bypass-constructor pattern, shares
data array) for promoting index dtype without copying data.

Tests: rewrite TestInt64SpGEMM in test_sparse_int64_indices.py to verify
correctness (large col values, duplicate product merging, empty products,
int32 regression).
…, multi-row)

Adds four tests to TestInt64SpGEMM covering paths not exercised by the
original four: alpha coefficient scaling, mixed int32/int64 input index
dtypes, float32 data preservation, and multi-row output (exercises the
cumsum+searchsorted row-expansion for rows beyond row 0).
Fix 9 sites in cusparse.py and _coo.py where check_contents=True in
tuple-2/3 sparse constructors silently downcasted explicitly-int64
index arrays (with values ≤ INT32_MAX) to int32.  Three bypass
patterns:

  A. COO shape-only constructor (csr2coo, csc2coo, _cupy_csrgeam_int64)
  B. nnz=0 conditional re-cast (_coo.tocsr, _coo.tocsc early returns)
  C. object.__new__ + direct attribute assignment (_cupy_csr2csc_int64,
     _cupy_csc2csr_int64, both nnz=0 and nnz>0 paths)

Pattern C is required for functions that handle large shapes: the
shape-only CSR/CSC constructor allocates an (n+1)-element indptr that
OOMs at ~17 GB for shapes near INT32_MAX.

Add TestInt64ConversionPreservation (11 tests) to the mainline int64
test file covering all fixed sites.
Add four tests to TestInt64ConversionPreservation covering important
correctness properties of the object.__new__ bypass from the prior commit:

- has_canonical_format / has_sorted_indices compute correctly via lazy
  kernel on object.__new__ results (no _has_canonical_format cached)
- copy() on an object.__new__ CSC preserves int64 index dtype
- tocoo() on _cupy_csr2csc_int64 output preserves int64 row/col dtype
- empty COO → tocsc (Pattern B) → tocsr (Pattern C) chain preserves int64
  end to end

Also adds test_csc_tocsr_empty_small_shape_preserves_int64 (the symmetric
counterpart to the CSR→CSC empty test, covering _cupy_csc2csr_int64 nnz=0).
…dptr building

CUDA provides atomicAdd(unsigned long long*,...) but not the signed long long
overload, so cupy.add.at raises TypeError for int64 destination arrays.
Reinterpreting int64 indptr[1:] as uint64 sidesteps this: two's-complement
addition is bit-identical for non-negative values, and indptr counts are
bounded by nnz which is far below 2^63 on any real system.

Replaces unique+scatter at 6 sites (coo2csr, coo2csc, _cupy_csr2csc_int64,
_cupy_csc2csr_int64, eliminate_zeros, _minor_index_fancy_sorted).  The new
approach is O(nnz) with GPU atomics vs O(nnz log nnz) for unique; measured
7-25x faster in benchmarks.
…sr/csc conversions

- spgemm_int64 threshold: (99000, None) → (13000, None).  CUDA 13.0 added
  native int64 SpGEMM (CUSPARSE-2365); cuSPARSE version encoding is
  major*1000 + minor*100 + patch (confirmed: 12.5.10 → 12510), so CUDA 13.0
  → 13000.  Mixed int32/int64 inputs are upcast to uniform int64 before the
  cuSPARSE call; pure-CuPy fallback only applies on CUDA < 13.0.

- Set has_sorted_indices = True on all four object.__new__ bypass paths in
  _cupy_csr2csc_int64 and _cupy_csc2csr_int64 (nnz=0 and main paths).
  The lexsort-based algorithm provably produces sorted indices; marking this
  avoids a redundant kernel scan on every .has_sorted_indices access.

- Fix spgeam() docstring: remove incorrect "CUDA >= 12.0" claim; SpGEAM
  is absent from all public releases through CUDA 13.2.

- Add rule in spgemm() comment: always upcast mixed int32/int64 inputs to
  uniform int64 and use cuSPARSE when the native API is available; never
  fall back to pure-CuPy merely because inputs have mixed index types.
- Add .real, .imag properties (_data.py) and trace() method (_base.py)
  to all sparse matrices, matching scipy.sparse

- Trim verbose "TEMPORARY WORKAROUND" comments throughout cusparse.py
  to concise one-liners; CuPy devs don't need the preamble

- Fix pre-existing "entories" typos in eliminate_zeros (_csr.py) and
  count_nonzero (_data.py) docstrings

- Remove unnecessary int(mask.sum()) D2H sync in eliminate_zeros;
  use len(new_data) (host metadata) instead

- Propagate has_sorted_indices in _with_indices_dtype (int32→int64
  cast preserves sort order)

- Fix E501 line-length violations in all changed files

- Add int64 index dtype documentation to scipy_sparse.rst

- Fix test line-length violations in test_sparse_int64_indices.py
…r2csc/csc2csr

- Add _indptr_to_coo(): replaces 12 inline searchsorted(indptr[1:],
  arange(nnz), side='right') call sites across 4 files

- Add _build_indptr(): replaces 6 inline add.at(view(uint64)) + cumsum
  call sites across 3 files; handles both int32 and int64

- Merge _cupy_csr2csc_int64 and _cupy_csc2csr_int64 into a single
  _cupy_transpose_compressed_int64(x, output_cls, out_dim)
…sites

Add _compressed_sparse_matrix._from_parts() and coo_matrix._from_parts()
for constructing from pre-validated arrays without check_contents
downcast.  Replaces 16 ad-hoc bypass patterns (object.__new__ +
manual attribute assignment, shape-only constructor + overwrite) with
a single clean internal API.

Also converts _with_data() and _with_indices_dtype() to use
_from_parts, eliminating every remaining bypass site.
…e int64 guards

- Replace 6 hardcoded name_expression lists (up to 8 entries each) with
  _idx_types and _arg_reduction_types class attributes + comprehensions

- Add _check_int32_indices() helper; replace 3 near-identical ValueError
  blocks in csrsm2, csrilu02, and spsolve
Add consistent TODO(cuSPARSE), TODO(cupy), TODO(hipSPARSE) tags to all
int64 fallback sites, replacing verbose TEMPORARY WORKAROUND text and
bare "int32-only" comments.

grep -r 'TODO(cuSPARSE)' finds all 22 cuSPARSE int32-only workarounds.
grep -r 'TODO(cupy)' finds all 4 CuPy core limitation workarounds.
- Fix coosort int64 path: _has_canonical_format = False silently set a
  nonexistent private attribute on COO (which uses has_canonical_format
  as a plain attribute, not a property); the flag remained True

- Use public has_canonical_format setter in spgeam (new code only)

- Document lazy-computed attributes in _from_parts docstrings
_set_many and _zero_many use arange(nnz) as position markers in a
temporary CSR to look up offsets.  Two pre-existing bugs:

1. float32 arange: only 24-bit mantissa, so positions collide for
   nnz > 16M.  Writes land on wrong neighbor.  Fix: use float64.

2. int32 offset cast: positions truncate for nnz > INT32_MAX.
   Valid positions wrap negative and get masked as not-found,
   causing silent fallthrough to duplicate insertion.
   Fix: use self.indices.dtype (int32 when nnz fits, int64 when not).
…und)

_insert_many uses cupy.add.at to count per-row insertions, but
add.at rejects int64 targets (CUDA lacks signed int64 atomicAdd).
Apply the same view(uint64) workaround used by _build_indptr.
- setdiag: use self.indices.dtype instead of hardcoded 'i' (int32)
  for auxiliary CSR construction.  col_st > INT32_MAX overflowed.

- diagonal: change kernel param from int32 k to I k so diagonal
  offset k > INT32_MAX can be passed without overflow.  Discovered
  because setdiag calls self.diagonal(k=k) internally.
eriknw added 7 commits March 21, 2026 10:35
…l matrices

The int32 fallback kernels (used for bool dtype) silently overflow
when m*n > INT32_MAX.  Add a ValueError guard with a suggestion to
convert to float (which routes through denseToSparse Generic API).
Full int64 kernel support deferred to a separate PR.
Use get_index_dtype(maxval=max(shape)) for offsets in the DIA
constructor and for indptr/indices in tocsc.  Template the tocsc
kernel with I (int32 or int64) instead of hardcoded int32.
Early return when sort_by='r' and has_canonical_format is True.
Saves ~5ms per tocsr() call on canonical 1M-element COO (~47%
speedup).  Benefits both int32 and int64 paths.

csrsort/cscsort were investigated but not added: has_sorted_indices
triggers an expensive GPU kernel scan (43ms for small matrices),
more costly than the sort itself.
Template int32 params to I in the dia_nnz ReductionKernel.
The DIA constructor now produces int64 offsets for large shapes,
but getnnz still expected int32.
random_state.choice(m*n, k) internally builds an O(m*n) permutation
array, making sparse.random() unusable for large shapes (e.g. 10^6 x
10^6 needs 8 TB). Add rejection-sampling fallback that uses O(k)
memory: generate random flat indices via rand()*mn, deduplicate with
unique, repeat for shortfall. Converges in 1-2 iterations for typical
sparse densities.

The original choice-based path is kept for moderate m*n (faster single
kernel), with an OOM try/except that falls back to rejection sampling.
Index dtype is selected based on whether m*n exceeds INT32_MAX.

Addresses the concern raised by @samaid in cupy#3513 about the Knuth
permutation algorithm being unsuitable for large sparse matrices.
Rewrite 3 tests that allocated ~17 GB GPU memory each, preventing
parallel test execution (pytest -n). The int64 kernel code paths
are still exercised:

- test_sum_axis0_large_cols: use forced-int64 small matrix instead
  of _LARGE-shaped matrix whose dense result is 17 GB
- test_dia_tocsr/tocsc_large_shape: swap shape from (2, _LARGE+1)
  to (_LARGE+1, 2) so the CSC indptr has 3 entries instead of 17 GB;
  max(shape) > INT32_MAX still triggers the int64 kernel path
@eriknw eriknw requested a review from a team as a code owner March 23, 2026 22:33
@leofang leofang self-assigned this Mar 24, 2026
@leofang leofang added this to the v15.0.0a1 milestone Mar 24, 2026
@leofang leofang added cat:feature New features/APIs to-be-backported Pull-requests to be backported to stable branch prio:high labels Mar 24, 2026
@eriknw
Copy link
Copy Markdown
Author

eriknw commented Mar 24, 2026

This work began with samaid@3014b99 from @samaid, which is the "spiritual" predecessor to this PR and helped inform things.

Add has_canonical_format / has_sorted_indices kwargs to _from_parts on
CSR/CSC/COO so sort and canonical flags can be set at construction time
instead of as post-hoc mutations.

This eliminates unnecessary device synchronizations in three paths:

- _with_data (used by copy, abs, neg, astype, real, imag): previously
  discarded cached flags, forcing a GPU kernel + D2H sync to recompute
  them on the next operation that checks canonical/sorted state.

- _set_many / _zero_many (used by __setitem__): previously constructed
  a throwaway CSR via the public constructor, which runs
  get_index_dtype(check_contents=True) — two D2H syncs to validate
  indices that are already known-valid.

- COO _from_parts: now defaults has_canonical_format=False, preventing
  AttributeError when copy/abs/neg is called on a _from_parts-created
  COO without explicit flag setting.
@seberg
Copy link
Copy Markdown
Member

seberg commented Mar 25, 2026

/test mini

eriknw added 4 commits March 25, 2026 12:15
CSR/CSC/COO transpose() and COO dot(scalar) used the public constructor,
which runs check_contents=True and downcasts small int64 values to
int32. Switch to _from_parts to preserve the caller's chosen dtype.

Also fix _min_or_max_axis to use _from_parts with explicit idx_dtype
(avoids float64 zeros as COO indices and preserves the input dtype on
the result), include b.indices in _compressed_sparse_stack's
get_index_dtype call, and add a short-circuit to _with_indices_dtype
when the dtype already matches.
…gemm

Replace public constructor calls with _from_parts in _major_slice,
_minor_slice, _major_index_fancy, binopt_csr, COO reshape, and the
pure-CuPy SpGEMM fallback. The public constructor runs
check_contents=True which downcasts small int64 values to int32,
silently losing the caller's chosen index dtype.

Extract _empty_like() helper on _compressed_sparse_matrix for the
recurring empty-result pattern (3 call sites). Handle the copy
parameter explicitly in _major_slice since _from_parts does not copy
(matters for getrow which passes copy=True).

Adds 7 regression tests to the mainline test file and 26 to
sparse_tests, all verified to fail without these changes.
Add a fast path for scalar comparisons (!=, >, <, ==, etc.) that
avoids the O(m*n) expansion in binopt_csr when op(0, scalar) is False.
In that case only stored entries can produce True, so the result is
computed in O(nnz) by applying the op to self.data and filtering.
This fixes OOM for large-shape matrices (e.g., m != 0 on a
2x(2^31+1) CSR previously tried to allocate 34 GB).

Also fix multiply_by_scalar, multiply_by_dense, and multiply_by_csr
to use _from_parts instead of the public constructor, preserving int64
index dtype and avoiding check_contents D2H syncs. This fixes the
asymmetry where m * 2.0 preserved int64 but m / 2.0 did not.

Fix _minor_index_fancy_sorted to propagate the source's index dtype
to the output instead of recomputing from the output shape.

Add dtype mismatch validation to _from_parts (rejects mismatched
indices/indptr dtypes that would cause downstream cuSPARSE failures).
The previous commit inadvertently reverted the _empty_like helper,
_major_slice, _minor_slice, and _major_index_fancy _from_parts fixes.
Restore them and also fix two empty-case gaps in _minor_index_fancy
and _minor_index_fancy_sorted that were missed previously.

Add _from_parts dtype mismatch validation (rejects mismatched
indices/indptr dtypes) and propagate source index dtype through
_minor_index_fancy_sorted output.

Adds regression tests for all changes.
@leofang
Copy link
Copy Markdown
Member

leofang commented Mar 27, 2026

/test mini

eriknw added 3 commits March 30, 2026 18:57
Enable native cuSPARSE SpGEMM for int64 on CUDA 13.0+. The
cuSPARSE version (12709) is the same for CUDA 12.7 and 13.0, so
dispatch checks the CUDA runtime version instead. Result matrices
use _from_parts to preserve int64 index dtype.

Template dense2csr kernels for int64 (B3): _d2c_atomic_add helper
for int64 atomicAdd, remove ValueError guard for large shapes.

Add _cumsum_int64 workaround for cupy.cumsum silent failure on
arrays >= 2^32 elements (CUB scan wrapper int overflow; fix in
PR cupy#9810). Used in _build_indptr and dense2csr.

Fix COO transpose falsely propagating has_canonical_format (caused
54 CI failures in LSMR/SVDs). Fix bmat crash on mixed sparse+dense
blocks. Fix _maximum_minimum, setdiag, _major_slice losing int64
dtype or flags. Move int64 routing before check_availability in
CSR/CSC conversion functions.

Convert 37 bare asserts to ValueError/TypeError in cusparse.py.
Add early-switch heuristic in random() based on allocation size.
Add 17 regression tests.
_get_csr_submatrix_major_axis always returns a new indptr array
(via subtraction), so the explicit copy in _major_slice was redundant.

Replace em-dashes and superscript characters with ASCII equivalents
in comments for consistency with the rest of the codebase.
@eriknw
Copy link
Copy Markdown
Author

eriknw commented Mar 30, 2026

/test mini

Update test_cusparse.py to expect ValueError instead of AssertionError
for non-F-contiguous spmm inputs (assert was converted to ValueError
in commit 91dd3f8).

Add GPU memory checks to dense2csr int64 tests that allocate ~19 GB
(2 GB dense bool + 17 GB info array). These OOM under memory pressure
when run after other tests in the full sparse suite.
@seberg
Copy link
Copy Markdown
Member

seberg commented Mar 31, 2026

/test mini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:feature New features/APIs prio:high to-be-backported Pull-requests to be backported to stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support sparse matrices with more than 2^31-1 elements

3 participants