Skip to content

Add short cut for subsetting along the minor axis#8468

Merged
asi1024 merged 13 commits into
cupy:mainfrom
Intron7:update-minor-slice
Jul 2, 2025
Merged

Add short cut for subsetting along the minor axis#8468
asi1024 merged 13 commits into
cupy:mainfrom
Intron7:update-minor-slice

Conversation

@Intron7
Copy link
Copy Markdown
Contributor

@Intron7 Intron7 commented Aug 8, 2024

This address #8452

I added a shortcut for sorted idx for the subsetting or when the X.indices are not sorted to begin with.

@kmaehashi kmaehashi added cat:performance Performance in terms of speed or memory consumption prio:medium labels Aug 9, 2024
Comment thread cupyx/scipy/sparse/_compressed.py Outdated
Comment on lines +428 to +429
idx_is_sorted = cupy.all(idx[:-1] <= idx[1:])
if not self.has_sorted_indices or idx_is_sorted:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx_is_sorted is a cupy.ndarray, so it requires DtoH synchronization. Aligning with CuPy's policy, we provide features as possible while minimizing the amount of DtoH synchronization.
How about sorting idx without checking it is already sorted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like sorting the index would create problems for the user. Because some people relay on sub-setting to sort there array.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about copying idx?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I don't understand why copying would help with the issue. I added this:

        idx_is_sorted = cupy.all(idx[:-1] <= idx[1:])
        if not self.has_sorted_indices or idx_is_sorted:

because I want the user to get sorted X.indices if X.has_sorted_indices. If we don't care about that, we could leave this out and just run with the shortcut in every case. However, calling X.sort_indices() is more expensive than the old solution (which still has one DtoH that I save).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would prefer not to perform additional computation on the GPU to decide whether to use a shortcut path or not. Could you modify to allow shortcuts only to the extent that they can be branched by simple calculations on the host side?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about copying idx?

I'm sorry, but I don't understand why copying would help with the issue.

I think this was in response to your earlier comment

I feel like sorting the index would create problems for the user.

My take is that @asi1024 suggested to copy idx and sort the copy, instead of sorting idx in-place (which could cause problems as you pointed out). After sorting, you don't need to compute idx_is_sorted and copy the boolean back to host, just proceed to use the sorted idx.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify the intent and functionality of the current implementation because I feel like we are misaligned:

  1. Unsorted Indices and Canonical Format:
    My current implementation explicitly supports subsetting with an unsorted index array. In other words, it correctly returns columns exactly in the order specified by the user, even if the provided indices are unsorted. However, in such cases, the resulting CSX matrix will no longer be in canonical format (sorted column indices for CSR).

  2. Why Sorting Doesn't Always Apply:
    Sorting the index will still not help with the creation of a canonical CSX matrix. Even an unsorted index results in the correct order of columns in a CSR matrix just not in canonical format. I think if a user subsets a Canonical matrix they want a canonical matrix out.

  3. Reason for the Existing Check:
    The only reason the current check (idx_is_sorted) exists is to preserve performance in cases where users subset from a canonical CSR matrix using an  unsorted index. The old implementation is faster than using my implementation and than calling sort_indices() afterwards.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this, and what Severin said makes sense to me. @asi1024 could you share any implementation opportunity that you see in achieving what you suggested (simple shortcut based only on host logics)?

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Aug 9, 2024

Benchmarks on an RTX 3090:

X.shape = (200,000, 27,998)
X_new.shape = (200,000, 19,214)
Old implementation = 500 ms
New implementation = 120 ms

Additionally, the shortcut is significantly more memory efficient. With 400,000 rows, the old implementation results in a memory error, while the new version completes without an issue.

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Feb 22, 2025

@leofang as mentioned, this PR is important for the single-cell community as the old implementation is a significant bottleneck and responsible for many oom errors due to CSR to CSC conversion.

Comment thread cupyx/scipy/sparse/_index.py Outdated
Comment thread cupyx/scipy/sparse/_compressed.py Outdated
Comment on lines +428 to +429
idx_is_sorted = cupy.all(idx[:-1] <= idx[1:])
if not self.has_sorted_indices or idx_is_sorted:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this, and what Severin said makes sense to me. @asi1024 could you share any implementation opportunity that you see in achieving what you suggested (simple shortcut based only on host logics)?

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Apr 28, 2025

@leofang @asi1024

After extensive testing and reviewing SciPy’s implementation, I found that my earlier version failed to handle duplicate indices such as [5, 5, 3, 2, 1, 1]. SciPy simply slices the axis without enforcing canonical CSR order, so unsorted input indices produce unsorted output rows. I have now re-implemented SciPy’s approach for CuPy with custom kernels that correctly support duplicates, eliminate the costly post-sort, run significantly faster, and use less memory. Let me know whether you’d prefer me to update the existing pull request with this implementation or create a new one.

@leofang
Copy link
Copy Markdown
Member

leofang commented Apr 28, 2025

If it is still in-scope for the original intent, feel free to just reuse the same PR. TBH either way is fine.

@Intron7 Intron7 requested review from asi1024 and leofang April 29, 2025 11:45
@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Apr 29, 2025

@leofang @asi1024 Thank you for your patience with this one. All cupyx.scipy.sparse tests are now passing on my system, and I’ve tried to keep the kernel formatting as close as possible to cupy.

@leofang
Copy link
Copy Markdown
Member

leofang commented Jun 24, 2025

/test mini

@leofang
Copy link
Copy Markdown
Member

leofang commented Jun 30, 2025

@asi1024 would it be possible for you to take another look? All tests have passed. Thanks! 🙂

_scalar.get_typename(self.data.dtype),
)
fillB = self._fill_B.get_function(ker_name)
threads = 32
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different thread counts are specified for calc_Bp_minor and fillB. Is this intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is intentional based on how _fill_B& _fill_B_complex use the shared row to update the nnz per row.

@asi1024 asi1024 added this to the v14.0.0a2 milestone Jul 2, 2025
@asi1024
Copy link
Copy Markdown
Member

asi1024 commented Jul 2, 2025

@Intron7 @leofang LGTM! Thank you for your contributions!

@asi1024 asi1024 merged commit 896f4ca into cupy:main Jul 2, 2025
70 checks passed
@Intron7 Intron7 deleted the update-minor-slice branch July 3, 2025 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption prio:medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants