-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Preserving dtype for np.float32 in RandomProjection #22114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Preserving dtype for np.float32 in RandomProjection #22114
Conversation
…0gauusian_random_projection
|
In tutorial.rst GaussianRandomProjection is used as an example not to preserve input dtype. This is why doc tests in CI fail. |
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @takoika !
sklearn/random_projection.py
Outdated
| self.components_ = self._make_random_matrix(self.n_components_, n_features) | ||
| self.components_ = self._make_random_matrix( | ||
| self.n_components_, n_features | ||
| ).astype(X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can prevent a copy for the np.float64 case:
| ).astype(X.dtype) | |
| ).astype(X.dtype, copy=False) |
Although, there is still a memory copy for float32, which is technically a regression for the float32 case.
A better solution would be to use Generator.standard_normal, which supports different dtype outputs. This way we do not need casting. To go down this path, we would need to support Generators, which is a long term goal: #20669
In the short term, we need to be content with this memory copy regression for float32. Practically, n_components by n_features matrix is not too big, so I think it should be okay. I would still wait to see what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the short term, we need to be content with this memory copy regression for float32. Practically, n_components by n_features matrix is not too big, so I think it should be okay. I would still wait to see what others think.
I agree we can probably leave with it. The non-copy of X with shape (n_samples, n_features) is likely to yield a net memory efficiency improvement when X.dtype == np.float32 in most practical cases.
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. LGTM once @thomasjpfan suggestion above and mine below are taken into account.
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
Reference Issues/PRs
This is a part of #11000 .
What does this implement/fix? Explain your changes.
This PR makes SparseRandomProjection and GaussianRandomProjection preserve numpy.float32 passed as input.
Any other comments?