Add short cut for subsetting along the minor axis#8468
Conversation
| idx_is_sorted = cupy.all(idx[:-1] <= idx[1:]) | ||
| if not self.has_sorted_indices or idx_is_sorted: |
There was a problem hiding this comment.
idx_is_sorted is a cupy.ndarray, so it requires DtoH synchronization. Aligning with CuPy's policy, we provide features as possible while minimizing the amount of DtoH synchronization.
How about sorting idx without checking it is already sorted?
There was a problem hiding this comment.
I feel like sorting the index would create problems for the user. Because some people relay on sub-setting to sort there array.
There was a problem hiding this comment.
I'm sorry, but I don't understand why copying would help with the issue. I added this:
idx_is_sorted = cupy.all(idx[:-1] <= idx[1:])
if not self.has_sorted_indices or idx_is_sorted:
because I want the user to get sorted X.indices if X.has_sorted_indices. If we don't care about that, we could leave this out and just run with the shortcut in every case. However, calling X.sort_indices() is more expensive than the old solution (which still has one DtoH that I save).
There was a problem hiding this comment.
Hmm, I would prefer not to perform additional computation on the GPU to decide whether to use a shortcut path or not. Could you modify to allow shortcuts only to the extent that they can be branched by simple calculations on the host side?
There was a problem hiding this comment.
How about copying
idx?I'm sorry, but I don't understand why copying would help with the issue.
I think this was in response to your earlier comment
I feel like sorting the index would create problems for the user.
My take is that @asi1024 suggested to copy idx and sort the copy, instead of sorting idx in-place (which could cause problems as you pointed out). After sorting, you don't need to compute idx_is_sorted and copy the boolean back to host, just proceed to use the sorted idx.
There was a problem hiding this comment.
Let me clarify the intent and functionality of the current implementation because I feel like we are misaligned:
-
Unsorted Indices and Canonical Format:
My current implementation explicitly supports subsetting with an unsorted index array. In other words, it correctly returns columns exactly in the order specified by the user, even if the provided indices are unsorted. However, in such cases, the resulting CSX matrix will no longer be in canonical format (sorted column indices for CSR). -
Why Sorting Doesn't Always Apply:
Sorting the index will still not help with the creation of a canonical CSX matrix. Even an unsorted index results in the correct order of columns in a CSR matrix just not in canonical format. I think if a user subsets a Canonical matrix they want a canonical matrix out. -
Reason for the Existing Check:
The only reason the current check (idx_is_sorted) exists is to preserve performance in cases where users subset from a canonical CSR matrix using an unsorted index. The old implementation is faster than using my implementation and than calling sort_indices() afterwards.
There was a problem hiding this comment.
I've been thinking about this, and what Severin said makes sense to me. @asi1024 could you share any implementation opportunity that you see in achieving what you suggested (simple shortcut based only on host logics)?
|
Benchmarks on an RTX 3090: X.shape = (200,000, 27,998) Additionally, the shortcut is significantly more memory efficient. With 400,000 rows, the old implementation results in a memory error, while the new version completes without an issue. |
|
@leofang as mentioned, this PR is important for the single-cell community as the old implementation is a significant bottleneck and responsible for many oom errors due to CSR to CSC conversion. |
| idx_is_sorted = cupy.all(idx[:-1] <= idx[1:]) | ||
| if not self.has_sorted_indices or idx_is_sorted: |
There was a problem hiding this comment.
I've been thinking about this, and what Severin said makes sense to me. @asi1024 could you share any implementation opportunity that you see in achieving what you suggested (simple shortcut based only on host logics)?
|
After extensive testing and reviewing SciPy’s implementation, I found that my earlier version failed to handle duplicate indices such as [5, 5, 3, 2, 1, 1]. SciPy simply slices the axis without enforcing canonical CSR order, so unsorted input indices produce unsorted output rows. I have now re-implemented SciPy’s approach for CuPy with custom kernels that correctly support duplicates, eliminate the costly post-sort, run significantly faster, and use less memory. Let me know whether you’d prefer me to update the existing pull request with this implementation or create a new one. |
|
If it is still in-scope for the original intent, feel free to just reuse the same PR. TBH either way is fine. |
|
/test mini |
|
@asi1024 would it be possible for you to take another look? All tests have passed. Thanks! 🙂 |
| _scalar.get_typename(self.data.dtype), | ||
| ) | ||
| fillB = self._fill_B.get_function(ker_name) | ||
| threads = 32 |
There was a problem hiding this comment.
Different thread counts are specified for calc_Bp_minor and fillB. Is this intentional?
There was a problem hiding this comment.
Yes that is intentional based on how _fill_B& _fill_B_complex use the shared row to update the nnz per row.
This address #8452
I added a shortcut for sorted
idxfor the subsetting or when theX.indicesare not sorted to begin with.