Skip to content

Conversation

@takoika
Copy link
Contributor

@takoika takoika commented Jan 2, 2022

Reference Issues/PRs

This is a part of #11000 .

What does this implement/fix? Explain your changes.

This PR makes SparseRandomProjection and GaussianRandomProjection preserve numpy.float32 passed as input.

Any other comments?

@takoika
Copy link
Contributor Author

takoika commented Jan 2, 2022

In tutorial.rst GaussianRandomProjection is used as an example not to preserve input dtype. This is why doc tests in CI fail.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @takoika !

self.components_ = self._make_random_matrix(self.n_components_, n_features)
self.components_ = self._make_random_matrix(
self.n_components_, n_features
).astype(X.dtype)
Copy link
Member

@thomasjpfan thomasjpfan Jan 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can prevent a copy for the np.float64 case:

Suggested change
).astype(X.dtype)
).astype(X.dtype, copy=False)

Although, there is still a memory copy for float32, which is technically a regression for the float32 case.

A better solution would be to use Generator.standard_normal, which supports different dtype outputs. This way we do not need casting. To go down this path, we would need to support Generators, which is a long term goal: #20669

In the short term, we need to be content with this memory copy regression for float32. Practically, n_components by n_features matrix is not too big, so I think it should be okay. I would still wait to see what others think.

Copy link
Member

@ogrisel ogrisel Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the short term, we need to be content with this memory copy regression for float32. Practically, n_components by n_features matrix is not too big, so I think it should be okay. I would still wait to see what others think.

I agree we can probably leave with it. The non-copy of X with shape (n_samples, n_features) is likely to yield a net memory efficiency improvement when X.dtype == np.float32 in most practical cases.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. LGTM once @thomasjpfan suggestion above and mine below are taken into account.

@takoika takoika requested a review from thomasjpfan January 4, 2022 16:19
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@takoika takoika requested a review from thomasjpfan January 6, 2022 10:54
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM

@thomasjpfan thomasjpfan changed the title ENH Preserving dtype for np.float32 in SparseRandomProjection and GaussianRandomProjection ENH Preserving dtype for np.float32 in RandomProjection Jan 7, 2022
@thomasjpfan thomasjpfan merged commit 8b6b519 into scikit-learn:main Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants