Skip to content

Implement cupy.linalg.eig / cupy.linalg.eigvals clone of PR #8854#8980

Merged
asi1024 merged 37 commits intocupy:mainfrom
mfoerste4:eig_eigvals_support
May 29, 2025
Merged

Implement cupy.linalg.eig / cupy.linalg.eigvals clone of PR #8854#8980
asi1024 merged 37 commits intocupy:mainfrom
mfoerste4:eig_eigvals_support

Conversation

@mfoerste4
Copy link
Copy Markdown
Contributor

This is PR #8854 with an updated main branch
CC @leofang

@leofang leofang added the takeover Pull-requests taken over from other contributor label Feb 24, 2025
@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 24, 2025

/test mini

@leofang leofang added the cat:feature New features/APIs label Feb 24, 2025
@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 25, 2025

@mfoerste4 found that the OOM issue is due to unguarded calls to the new cuSOLVER routines, which is only available in recent CUDA 12.x (which one I forgot, it's in release notes and the linked issue). The evidence is CUDA 12.8 CIs passed. We need to add a version guard like what we did in the past for other cuSOLVER routines.

@asi1024 asi1024 self-assigned this Feb 25, 2025
@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 25, 2025

/test mini

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented Feb 25, 2025

@mfoerste4 Thank you for taking over the pull request! Considering the _geev function is called repeatedly by the eig function, I think some redundant calculations can be eliminated.

  • My understanding is that work_device_size and work_host_size return the same value for all _geev calls. If that is correct, we could allocate work_device and work_host in advance and reuse them.
  • We can eliminate the copy when cocatenating the return values of _geev with cupy.stack. Could you fix to allocate the concatenated memory in advance?
  • Please avoid repeatedly calling all(w.imag == 0.0). It should be checked against the combined result. Also, instead of using the Python primitive all, rewrite it to use the cupy.ndarray method, like (w.imag == 0.0).all().
  • Could you convert the input array to Fortran format in advance before calling _geev?

@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 26, 2025

We need to add a version guard like what we did in the past for other cuSOLVER routines.

Looks like it's fixed now! (The remaining CI failures are irrelevant.)

I think some redundant calculations can be eliminated.

@mfoerste4 could you kindly take a look at Akifumi-san's comments?

@EarlMilktea
Copy link
Copy Markdown
Member

EarlMilktea commented Feb 27, 2025

@mfoerste4 @leofang Sorry for making you wait long (was very busy recently). Please keep me up to date!

@mfoerste4
Copy link
Copy Markdown
Contributor Author

mfoerste4 commented Feb 27, 2025

@asi1024 , thanks for your response, comments inline

@mfoerste4 Thank you for taking over the pull request! Considering the _geev function is called repeatedly by the eig function, I think some redundant calculations can be eliminated.

* My understanding is that `work_device_size` and `work_host_size` return the same value for all `_geev` calls. If that is correct, we could allocate `work_device` and `work_host` in advance and reuse them.

I don't think we can make any assumptions about these in general. For stacked computations you are probably right.

* Please avoid repeatedly calling `all(w.imag == 0.0)`. It should be checked against the combined result. Also, instead of using the Python primitive all, rewrite it to use the `cupy.ndarray` method, like `(w.imag == 0.0).all()`.

Yes, I will add that

* Could you convert the input array to Fortran format in advance before calling `_geev`?

I don't think that is possible. IIUC we need each matrix in Fortran format, but each matrix should remain coalesced in memory. In strides for a (B x M x M) input this would be (M*M, 1, M) which is not Fortran style. So I guess the best we can do is the local transpose for each matrix in _geev (as we need a copy anyways as cusolver might overwrite)

* We can eliminate the copy when cocatenating the return values of `_geev` with `cupy.stack`. Could you fix to allocate the concatenated memory in advance?

The stacking for multiple geev computes seems to be broken in the current PR - the tests only cover 2D. For pre-allocation the same issue with the data layout arises as mentioned above.

As this is my first contact with cupy - if you have a proposal on how to implement the stacking I would be happy to hear it.

@mfoerste4
Copy link
Copy Markdown
Contributor Author

@asi1024 , I moved the batching into the inner loop and used the pre-allocated output(s) as much as possible. For real input we still need a local instance for real eigenvectors because of the way the cusolver API is designed.

@kmaehashi
Copy link
Copy Markdown
Member

@asi1024 kindly ping

@ev-br
Copy link
Copy Markdown
Contributor

ev-br commented May 19, 2025

It's great to see this, I'm looking forward to be able to use it!

For the dtype of eigenvalues: it can be argued that numpy's decision to have value-dependent dtype is a mistake. Other array libraries (notably, jax.numpy, pytorch) always return eigenvalues as a complex array, and it's up to a user to decide if they want to check for zero imaginary parts.
So maybe CuPy could consider to do the same and avoid numpy's "return w.real if w.imag==0 else w" contraption.
For completeness, here's a link to an Array API discussion: data-apis/array-api#935

@mfoerste4
Copy link
Copy Markdown
Contributor Author

@asi1024 , I don't think there are any open items left. Would you mind triggering the main CI tests?

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented May 28, 2025

/test mini

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented May 28, 2025

@ev-br @mfoerste4

So maybe CuPy could consider to do the same and avoid numpy's "return w.real if w.imag==0 else w" contraption.

Our team concluded that it is acceptable to avoid changing the output tensor's dtype based on the output value. It is reasonable in the sense of avoiding DtoH synchronization.

@kmaehashi
Copy link
Copy Markdown
Member

Could you skip eiv/eivgals tests on older CUDA using this?

@pytest.mark.skipif(
    cupy.cuda.runtime.runtimeGetVersion() < 12060, reason='Requires CUDA 12.6+')

Also let's uncomment these two lines to list these APIs on docs:

# linalg.eig
linalg.eigh
# linalg.eigvals

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented May 28, 2025

This pull request is now in conflicts. Could you fix it @mfoerste4? 🙏

@mfoerste4
Copy link
Copy Markdown
Contributor Author

Thanks @kmaehashi for the review. I have made the requested changes. Please note that the tests had to be adjusted to account for cupy now always returning complex types.
I also changed the host buffer to be pinned memory as this was suggested by the cusolver documentation.

@kmaehashi
Copy link
Copy Markdown
Member

Thanks @mfoerste4! The change looks good to me, let me kick the CI again.
@asi1024 Do you have any other suggestions on this PR?

/test mini

Copy link
Copy Markdown
Member

@asi1024 asi1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late review. LGTM!

@asi1024 asi1024 merged commit 1cbf094 into cupy:main May 29, 2025
62 checks passed
@kmaehashi kmaehashi added this to the v14.0.0a2 milestone May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:feature New features/APIs prio:high takeover Pull-requests taken over from other contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants