Add int64 index support to cupyx.scipy.sparse#9825
Open
Add int64 index support to cupyx.scipy.sparse#9825
cupyx.scipy.sparse#9825Conversation
Previously, all sparse matrix constructors hardcoded `astype('i')` (int32)
for index arrays, silently truncating any index value > INT32_MAX.
This change mirrors scipy's behavior: use int32 when all index values fit,
promote to int64 when they don't (via `get_index_dtype(check_contents=True)`).
Changes:
- `_coo.py`: shape-branch and final cast use `get_index_dtype`; scipy import
preserves source dtype instead of forcing int32
- `_compressed.py`: all six construction branches compute `idx_dtype`
(CuPy sparse, shape, scipy sparse, COO tuple, data/indices/indptr tuple,
dense); final `astype('i')` replaced with `astype(idx_dtype)`
- `_construct.py`: `eye()` fast path uses `get_index_dtype(maxval=max(m,n))`
for CSR, CSC, and COO formats
Safety guards (prevent silent corruption with int32-only cuSPARSE routines):
- `cusparse.csrilu02`: raise ValueError for int64 indices
- `cusparse.csrsm2`: raise ValueError for int64 indices
- `cusparse.csr2coo`: raise ValueError for int64 indptr (xcsr2coo reads
indptr as int32, producing wrong row indices silently)
- `linalg.spsolve`: raise ValueError for int64 indices
Operations not yet supported with int64:
- `tocsr()` from COO (coosort/coo2csr are int32-only)
- `tocsc()` from CSR (csr2csc is int32-only)
- `tocoo()` from CSR (csr2coo guard raises until Phase 2)
- `sort_indices`, `sum_duplicates`, sparse add/multiply (later phases)
cuSPARSE's conversion/sort functions (xcoo2csr, xcsr2coo, csr2cscEx2, xcsrsort, xcoosortBy*) only support int32 indices. Add pure-CuPy fallbacks that activate when indices.dtype == int64: - coo2csr / coo2csc: unique+scatter builds indptr without bincount (bincount would require two simultaneous ~17GB allocations on 32GB GPUs) - csr2coo / csc2coo: searchsorted expands indptr → row/col array (cupy.repeat with ndarray repeats raises ValueError) - csr2csc / csc2csr: new _cupy_csr2csc_int64 / _cupy_csc2csr_int64 helpers using unique+scatter + searchsorted + lexsort - csrsort / cscsort / coosort: lexsort-based fallback - check_availability guards moved after int64 checks so pure-CuPy paths work on CUDA versions where legacy sort/coo functions are unavailable Fix COO sum_duplicates ElementwiseKernels: hardcoded int32 src_row/ src_col/index silently truncated int64 values. Replace with generic type I in all three kernel variants (bool, float, complex).
- csrgeam2: route int64 to pure-CuPy fallback (_cupy_csrgeam_int64) before the cuSPARSE availability check, matching the previous pattern applied to csrsort/coosort/csr2coo - spgeam: add Generic API wrapper for cusparseSpGEAM via SoftLink + inline C forward declarations (absent from all public CUDA releases through 12.7.9; pure-CuPy fallback always active today) - spgemm: fix result index dtype to use result_type(a, b) instead of hardcoded int32; note cuSPARSE SpGEMM rejects int64 at runtime - idx_dtype: use numpy.result_type(a.indices.dtype, b.indices.dtype) throughout csrgeam2/spgeam/spgemm/fallback to handle mixed-dtype inputs
_index.py: - _compress_getitem_kern/_compress_getitem_complex_kern: change `int32 minor` to `S minor` so scalar lookup (m[i, j]) works for column indices > INT32_MAX - _csr_row_index_ker: change all `int32` to `I` so fancy row indexing (m[[r1,r2], :]) preserves int64 column indices - _csr_row_index: cast `rows` to Ap.dtype for consistent kernel type param - _csr_indptr_to_coo_rows: add int64 fallback via cupy.searchsorted since xcsr2coo (cuSPARSE) is int32-only _compressed.py: - _argmax_argmin_code: add TI template parameter for index/indptr types so argmax(axis=N) returns correct column coordinates for int64 matrices - Expand name_expressions from 4 to 8 entries (add int64 index variants) - _arg_minor_reduce: cast indptr slices to indices.dtype; use idx_dtype.type() for length parameter instead of always int64
… kernels
- eliminate_zeros: pure-CuPy fallback for int64 (searchsorted + unique/scatter)
because cusparse.csr2csr_compress is int32-only
- _max_min_reduction_code: template<typename TI> + RawModule name_expressions,
replacing RawKernel with hardcoded int* indptr parameters
- _GET_ROW_ID_ / _FIND_INDEX_HOLDING_COL_IN_ROW_ preambles: template<typename I>
with _atomic_add_one helper (CUDA atomicAdd has no long long* overload)
- 4 kernel bodies (cupy_multiply_by_dense, cupy_multiply_by_csr_step{1,2},
_cupy_csr2dense, _cupy_csr_diagonal): int locals → I; shape params int32 M/N → I
(shape params must also be I: int32 params raise OverflowError for shape > INT32_MAX)
- spgemm: explicit ValueError for int64 (Generic API rejects int64 at runtime)
- Tests: TestInt64EliminateZeros, TestInt64Multiply, TestInt64Diagonal (11 tests)
- Move spgemm int64 guard before has_canonical_format assert and shape check (fires on any int64 CSR; previously required compatible shapes) - TestInt64MinMaxReduction: 4 tests for _minor_reduce with int64 indptr (axis=1 max/min, axis=0 on CSC, int32 regression) - TestInt64Toarray: 3 tests for _cupy_csr2dense int64 shape params (@testing.slow OOM-guarded large-shape tests + int32 regression) - test_spgemm_rejects_int64: verifies clear ValueError for int64 inputs
Five independent `check_contents=True` downcast barriers in the vstack/hstack/bmat/tocsr pipeline caused int64 index arrays to silently revert to int32. Fix all five: - `_compressed_sparse_stack`: bypass tuple-3 CSR/CSC constructor (use shape-only constructor + direct attribute assignment) - `bmat`: use flat-index product heuristic (nrows*ncols-1) instead of max(shape) for idx_dtype, so large-product matrices use int64; bypass COO tuple-2 constructor for the same reason - `coo_matrix._with_data`: bypass tuple-2 constructor to preserve index dtype through copy(), abs(), astype(), scalar multiply, etc. - `_compressed._with_data`: same fix for CSR/CSC operations - `coo2csr` / `coo2csc`: bypass tuple-3 CSR/CSC constructor in return
Add TestInt64DtypePreservation class (15 tests) to test_sparse_int64_indices.py covering the check_contents=True downcast barriers fixed: copy/abs/neg/scalar-mul/astype on CSR and CSC, copy on COO with has_canonical_format propagation, vstack, hstack (flat-index product heuristic), coo2csr/coo2csc bypass, and int32 regressions. All tests use small index values that fit in int32 — the scenario previously broken.
_minor_index_fancy() previously used a histogram kernel that allocated O(N) memory (where N is the minor-axis size), causing OOM for int64 matrices where N > INT32_MAX. The kernels also silently truncated index values > INT32_MAX via const int* parameters. Add _minor_index_fancy_sorted(), a sort-based binary-search algorithm that routes int64 matrices through argsort + searchsorted instead of the histogram. Time: O((nnz + k) log(nnz + k)); space: O(nnz + k) where k is the number of requested indices. Works for both CSR column indexing and CSC row indexing. The int32 kernel path is unchanged.
Ten tests for _minor_index_fancy_sorted in TestInt64FancyMinorIndex: large column/row values, multiple columns, duplicates, absent entries, complex data, CSC symmetry, unsorted source, and an int32 regression check (int32 kernel path must remain unchanged).
cuSPARSE spGEMM (Generic API) returns CUSPARSE_STATUS_NOT_SUPPORTED for
any int64 index descriptor on all public releases through 12.7.9, despite
advertising int64 support via SpMatDescr.
Add _cupy_spgemm_int64 to cusparse.py: a sort-merge SpGEMM that generates
all partial products (i, j, a_ik*b_kj) as a flat COO array, then merges
via sum_duplicates. Uses cumsum+searchsorted to expand indptr and repeat
counts (cupy.repeat rejects CuPy-ndarray repeat-counts). Both input index
arrays are normalised to int64 at entry; P-sized temporaries are freed
explicitly before sum_duplicates allocates its own scratch space.
Sort-merge is O(P log P) time and O(P+nnz_C) space, where P is total
partial products. Gustavson's row-wise algorithm (O(N) dense workspace
per row) is infeasible for int64 where N can exceed 2^31.
Dispatch in spgemm routes int64 inputs before check_availability so the
fallback works on all CUDA versions. A forward-compatible path is wired
via check_availability('spgemm_int64') (threshold 99000, no public release
has it yet): when native int64 spgemm ships, updating that version number
activates the cuSPARSE path automatically. Any int32 input is upcasted to
int64 via _with_indices_dtype (shares data array) so cuSPARSE receives
uniform int64 as required. The int32 cuSPARSE path is unchanged.
Also add _with_indices_dtype helper (bypass-constructor pattern, shares
data array) for promoting index dtype without copying data.
Tests: rewrite TestInt64SpGEMM in test_sparse_int64_indices.py to verify
correctness (large col values, duplicate product merging, empty products,
int32 regression).
…, multi-row) Adds four tests to TestInt64SpGEMM covering paths not exercised by the original four: alpha coefficient scaling, mixed int32/int64 input index dtypes, float32 data preservation, and multi-row output (exercises the cumsum+searchsorted row-expansion for rows beyond row 0).
Fix 9 sites in cusparse.py and _coo.py where check_contents=True in
tuple-2/3 sparse constructors silently downcasted explicitly-int64
index arrays (with values ≤ INT32_MAX) to int32. Three bypass
patterns:
A. COO shape-only constructor (csr2coo, csc2coo, _cupy_csrgeam_int64)
B. nnz=0 conditional re-cast (_coo.tocsr, _coo.tocsc early returns)
C. object.__new__ + direct attribute assignment (_cupy_csr2csc_int64,
_cupy_csc2csr_int64, both nnz=0 and nnz>0 paths)
Pattern C is required for functions that handle large shapes: the
shape-only CSR/CSC constructor allocates an (n+1)-element indptr that
OOMs at ~17 GB for shapes near INT32_MAX.
Add TestInt64ConversionPreservation (11 tests) to the mainline int64
test file covering all fixed sites.
Add four tests to TestInt64ConversionPreservation covering important correctness properties of the object.__new__ bypass from the prior commit: - has_canonical_format / has_sorted_indices compute correctly via lazy kernel on object.__new__ results (no _has_canonical_format cached) - copy() on an object.__new__ CSC preserves int64 index dtype - tocoo() on _cupy_csr2csc_int64 output preserves int64 row/col dtype - empty COO → tocsc (Pattern B) → tocsr (Pattern C) chain preserves int64 end to end Also adds test_csc_tocsr_empty_small_shape_preserves_int64 (the symmetric counterpart to the CSR→CSC empty test, covering _cupy_csc2csr_int64 nnz=0).
…dptr building CUDA provides atomicAdd(unsigned long long*,...) but not the signed long long overload, so cupy.add.at raises TypeError for int64 destination arrays. Reinterpreting int64 indptr[1:] as uint64 sidesteps this: two's-complement addition is bit-identical for non-negative values, and indptr counts are bounded by nnz which is far below 2^63 on any real system. Replaces unique+scatter at 6 sites (coo2csr, coo2csc, _cupy_csr2csc_int64, _cupy_csc2csr_int64, eliminate_zeros, _minor_index_fancy_sorted). The new approach is O(nnz) with GPU atomics vs O(nnz log nnz) for unique; measured 7-25x faster in benchmarks.
…sr/csc conversions - spgemm_int64 threshold: (99000, None) → (13000, None). CUDA 13.0 added native int64 SpGEMM (CUSPARSE-2365); cuSPARSE version encoding is major*1000 + minor*100 + patch (confirmed: 12.5.10 → 12510), so CUDA 13.0 → 13000. Mixed int32/int64 inputs are upcast to uniform int64 before the cuSPARSE call; pure-CuPy fallback only applies on CUDA < 13.0. - Set has_sorted_indices = True on all four object.__new__ bypass paths in _cupy_csr2csc_int64 and _cupy_csc2csr_int64 (nnz=0 and main paths). The lexsort-based algorithm provably produces sorted indices; marking this avoids a redundant kernel scan on every .has_sorted_indices access. - Fix spgeam() docstring: remove incorrect "CUDA >= 12.0" claim; SpGEAM is absent from all public releases through CUDA 13.2. - Add rule in spgemm() comment: always upcast mixed int32/int64 inputs to uniform int64 and use cuSPARSE when the native API is available; never fall back to pure-CuPy merely because inputs have mixed index types.
- Add .real, .imag properties (_data.py) and trace() method (_base.py) to all sparse matrices, matching scipy.sparse - Trim verbose "TEMPORARY WORKAROUND" comments throughout cusparse.py to concise one-liners; CuPy devs don't need the preamble - Fix pre-existing "entories" typos in eliminate_zeros (_csr.py) and count_nonzero (_data.py) docstrings - Remove unnecessary int(mask.sum()) D2H sync in eliminate_zeros; use len(new_data) (host metadata) instead - Propagate has_sorted_indices in _with_indices_dtype (int32→int64 cast preserves sort order) - Fix E501 line-length violations in all changed files - Add int64 index dtype documentation to scipy_sparse.rst - Fix test line-length violations in test_sparse_int64_indices.py
…, unary ops, linalg guards
…r2csc/csc2csr - Add _indptr_to_coo(): replaces 12 inline searchsorted(indptr[1:], arange(nnz), side='right') call sites across 4 files - Add _build_indptr(): replaces 6 inline add.at(view(uint64)) + cumsum call sites across 3 files; handles both int32 and int64 - Merge _cupy_csr2csc_int64 and _cupy_csc2csr_int64 into a single _cupy_transpose_compressed_int64(x, output_cls, out_dim)
…sites Add _compressed_sparse_matrix._from_parts() and coo_matrix._from_parts() for constructing from pre-validated arrays without check_contents downcast. Replaces 16 ad-hoc bypass patterns (object.__new__ + manual attribute assignment, shape-only constructor + overwrite) with a single clean internal API. Also converts _with_data() and _with_indices_dtype() to use _from_parts, eliminating every remaining bypass site.
…e int64 guards - Replace 6 hardcoded name_expression lists (up to 8 entries each) with _idx_types and _arg_reduction_types class attributes + comprehensions - Add _check_int32_indices() helper; replace 3 near-identical ValueError blocks in csrsm2, csrilu02, and spsolve
Add consistent TODO(cuSPARSE), TODO(cupy), TODO(hipSPARSE) tags to all int64 fallback sites, replacing verbose TEMPORARY WORKAROUND text and bare "int32-only" comments. grep -r 'TODO(cuSPARSE)' finds all 22 cuSPARSE int32-only workarounds. grep -r 'TODO(cupy)' finds all 4 CuPy core limitation workarounds.
- Fix coosort int64 path: _has_canonical_format = False silently set a nonexistent private attribute on COO (which uses has_canonical_format as a plain attribute, not a property); the flag remained True - Use public has_canonical_format setter in spgeam (new code only) - Document lazy-computed attributes in _from_parts docstrings
_set_many and _zero_many use arange(nnz) as position markers in a temporary CSR to look up offsets. Two pre-existing bugs: 1. float32 arange: only 24-bit mantissa, so positions collide for nnz > 16M. Writes land on wrong neighbor. Fix: use float64. 2. int32 offset cast: positions truncate for nnz > INT32_MAX. Valid positions wrap negative and get masked as not-found, causing silent fallthrough to duplicate insertion. Fix: use self.indices.dtype (int32 when nnz fits, int64 when not).
…und) _insert_many uses cupy.add.at to count per-row insertions, but add.at rejects int64 targets (CUDA lacks signed int64 atomicAdd). Apply the same view(uint64) workaround used by _build_indptr.
- setdiag: use self.indices.dtype instead of hardcoded 'i' (int32) for auxiliary CSR construction. col_st > INT32_MAX overflowed. - diagonal: change kernel param from int32 k to I k so diagonal offset k > INT32_MAX can be passed without overflow. Discovered because setdiag calls self.diagonal(k=k) internally.
…l matrices The int32 fallback kernels (used for bool dtype) silently overflow when m*n > INT32_MAX. Add a ValueError guard with a suggestion to convert to float (which routes through denseToSparse Generic API). Full int64 kernel support deferred to a separate PR.
Use get_index_dtype(maxval=max(shape)) for offsets in the DIA constructor and for indptr/indices in tocsc. Template the tocsc kernel with I (int32 or int64) instead of hardcoded int32.
Early return when sort_by='r' and has_canonical_format is True. Saves ~5ms per tocsr() call on canonical 1M-element COO (~47% speedup). Benefits both int32 and int64 paths. csrsort/cscsort were investigated but not added: has_sorted_indices triggers an expensive GPU kernel scan (43ms for small matrices), more costly than the sort itself.
Template int32 params to I in the dia_nnz ReductionKernel. The DIA constructor now produces int64 offsets for large shapes, but getnnz still expected int32.
random_state.choice(m*n, k) internally builds an O(m*n) permutation array, making sparse.random() unusable for large shapes (e.g. 10^6 x 10^6 needs 8 TB). Add rejection-sampling fallback that uses O(k) memory: generate random flat indices via rand()*mn, deduplicate with unique, repeat for shortfall. Converges in 1-2 iterations for typical sparse densities. The original choice-based path is kept for moderate m*n (faster single kernel), with an OOM try/except that falls back to rejection sampling. Index dtype is selected based on whether m*n exceeds INT32_MAX. Addresses the concern raised by @samaid in cupy#3513 about the Knuth permutation algorithm being unsuitable for large sparse matrices.
Rewrite 3 tests that allocated ~17 GB GPU memory each, preventing parallel test execution (pytest -n). The int64 kernel code paths are still exercised: - test_sum_axis0_large_cols: use forced-int64 small matrix instead of _LARGE-shaped matrix whose dense result is 17 GB - test_dia_tocsr/tocsc_large_shape: swap shape from (2, _LARGE+1) to (_LARGE+1, 2) so the CSC indptr has 3 entries instead of 17 GB; max(shape) > INT32_MAX still triggers the int64 kernel path
Author
|
This work began with samaid@3014b99 from @samaid, which is the "spiritual" predecessor to this PR and helped inform things. |
Add has_canonical_format / has_sorted_indices kwargs to _from_parts on CSR/CSC/COO so sort and canonical flags can be set at construction time instead of as post-hoc mutations. This eliminates unnecessary device synchronizations in three paths: - _with_data (used by copy, abs, neg, astype, real, imag): previously discarded cached flags, forcing a GPU kernel + D2H sync to recompute them on the next operation that checks canonical/sorted state. - _set_many / _zero_many (used by __setitem__): previously constructed a throwaway CSR via the public constructor, which runs get_index_dtype(check_contents=True) — two D2H syncs to validate indices that are already known-valid. - COO _from_parts: now defaults has_canonical_format=False, preventing AttributeError when copy/abs/neg is called on a _from_parts-created COO without explicit flag setting.
Member
|
/test mini |
CSR/CSC/COO transpose() and COO dot(scalar) used the public constructor, which runs check_contents=True and downcasts small int64 values to int32. Switch to _from_parts to preserve the caller's chosen dtype. Also fix _min_or_max_axis to use _from_parts with explicit idx_dtype (avoids float64 zeros as COO indices and preserves the input dtype on the result), include b.indices in _compressed_sparse_stack's get_index_dtype call, and add a short-circuit to _with_indices_dtype when the dtype already matches.
…gemm Replace public constructor calls with _from_parts in _major_slice, _minor_slice, _major_index_fancy, binopt_csr, COO reshape, and the pure-CuPy SpGEMM fallback. The public constructor runs check_contents=True which downcasts small int64 values to int32, silently losing the caller's chosen index dtype. Extract _empty_like() helper on _compressed_sparse_matrix for the recurring empty-result pattern (3 call sites). Handle the copy parameter explicitly in _major_slice since _from_parts does not copy (matters for getrow which passes copy=True). Adds 7 regression tests to the mainline test file and 26 to sparse_tests, all verified to fail without these changes.
Add a fast path for scalar comparisons (!=, >, <, ==, etc.) that avoids the O(m*n) expansion in binopt_csr when op(0, scalar) is False. In that case only stored entries can produce True, so the result is computed in O(nnz) by applying the op to self.data and filtering. This fixes OOM for large-shape matrices (e.g., m != 0 on a 2x(2^31+1) CSR previously tried to allocate 34 GB). Also fix multiply_by_scalar, multiply_by_dense, and multiply_by_csr to use _from_parts instead of the public constructor, preserving int64 index dtype and avoiding check_contents D2H syncs. This fixes the asymmetry where m * 2.0 preserved int64 but m / 2.0 did not. Fix _minor_index_fancy_sorted to propagate the source's index dtype to the output instead of recomputing from the output shape. Add dtype mismatch validation to _from_parts (rejects mismatched indices/indptr dtypes that would cause downstream cuSPARSE failures).
The previous commit inadvertently reverted the _empty_like helper, _major_slice, _minor_slice, and _major_index_fancy _from_parts fixes. Restore them and also fix two empty-case gaps in _minor_index_fancy and _minor_index_fancy_sorted that were missed previously. Add _from_parts dtype mismatch validation (rejects mismatched indices/indptr dtypes) and propagate source index dtype through _minor_index_fancy_sorted output. Adds regression tests for all changes.
Member
|
/test mini |
Enable native cuSPARSE SpGEMM for int64 on CUDA 13.0+. The cuSPARSE version (12709) is the same for CUDA 12.7 and 13.0, so dispatch checks the CUDA runtime version instead. Result matrices use _from_parts to preserve int64 index dtype. Template dense2csr kernels for int64 (B3): _d2c_atomic_add helper for int64 atomicAdd, remove ValueError guard for large shapes. Add _cumsum_int64 workaround for cupy.cumsum silent failure on arrays >= 2^32 elements (CUB scan wrapper int overflow; fix in PR cupy#9810). Used in _build_indptr and dense2csr. Fix COO transpose falsely propagating has_canonical_format (caused 54 CI failures in LSMR/SVDs). Fix bmat crash on mixed sparse+dense blocks. Fix _maximum_minimum, setdiag, _major_slice losing int64 dtype or flags. Move int64 routing before check_availability in CSR/CSC conversion functions. Convert 37 bare asserts to ValueError/TypeError in cusparse.py. Add early-switch heuristic in random() based on allocation size. Add 17 regression tests.
_get_csr_submatrix_major_axis always returns a new indptr array (via subtraction), so the explicit copy in _major_slice was redundant. Replace em-dashes and superscript characters with ASCII equivalents in comments for consistency with the rest of the codebase.
Author
|
/test mini |
Update test_cusparse.py to expect ValueError instead of AssertionError for non-F-contiguous spmm inputs (assert was converted to ValueError in commit 91dd3f8). Add GPU memory checks to dense2csr int64 tests that allocate ~19 GB (2 GB dense bool + 17 GB info array). These OOM under memory pressure when run after other tests in the full sparse suite.
Member
|
/test mini |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3513.
Sparse matrices (CSR, CSC, COO) now support int64
indices/indptr/row/col, matchingscipy.sparsesemantics. Index dtype is chosen automatically viaget_index_dtype(check_contents=True)— int32 when values fit, int64 when they don't. All existing int32 behavior is preserved.Matrices with dimensions or nnz exceeding ~2.1 billion (INT32_MAX) previously failed silently or raised unhelpful errors. This is a prerequisite for large-scale sparse workloads (e.g., single-cell genomics via rapids-singlecell, large graph analytics). There is already a companion PR that uses this branch: scverse/rapids-singlecell#620
Note that this PR was made with assistance of Claude Code (models: Sonnet 4.6, Opus 4.6). I have been heavily involved every step of the way, which I hope is clear from the level of hardening already present. Nevertheless, as is typical with AI-assisted work, I am part author and part reviewer, and I am still reviewing these changes.
How
SpMatDescr— no changes needed for these pathsint*orint32parameters now use template parameterI_from_parts()classmethods on CSR/CSC/COO for internal construction withoutcheck_contentsdowncast, replacing ad-hocobject.__new__patterns_indptr_to_coo,_build_indptr) consolidate repeated patternsAlso fixed along the way: float32 precision bug in
__setitem__for nnz > 16M, several int32 overflow issues in binopt/setdiag/DIA kernels that affected large matrices regardless of index dtype, andrandom()/rand()O(m*n) memory usage for large shapes (noted by @samaid in #3513).What remains
dense2csrbool-dtype fallback still int32 (guarded with ValueError, medium-effort fix)cupy.repeatupstream fixes would simplify several workarounds (but the workaround are fast, and not too bad). I plan to create a PR that supportcupy.repeatwith cupy arrays asrepeatsargument (I am nearly finished with this)Future work is to add support for sparse arrays (not matrix classes).
scipy.sparsesemantics for sparse arrays is to simply use the dtype of the indices that the user provided (and then if everything isn't int32, then upcast to int64 as appropriate). Supporting mixed dtypes for
indices/indptr/row/colfor sparse matrix classes is something we could consider, but I would prefer to do that in another PR. Also note that sparse matrix classes are being deprecated in scipy and may be removed in 2027.This is a large PR. It's also my first major contribution to CuPy. I can break this into multiple PRs if it would help. I value feedback.
Here is a review by Claude Code that explains some of the implementation details: line_by_line_full.md