Skip to content

Conversation

@jeremiedbb
Copy link
Member

The value is meant to change to 20000 for the 1.3 release.
When implementing the change I faced an error because we used to not support subsampling for other strategies than "quantile". This is an issue because we are setting the default to use subsampling.

Looking at this, I don't see any reason not to support subsampling for the "kmeans" and "uniform" strategies, especially since we set the default value very high. Note that there was no test for the behavior of subsampling, so I added a simple one to check that the bin edges a somewhat close the the ones obtained without subsampling. I propose to now support subsampling for all strategies.

@jeremiedbb jeremiedbb added the API label May 24, 2023
@jeremiedbb jeremiedbb added this to the 1.3 milestone May 24, 2023
@jeremiedbb
Copy link
Member Author

ping @glemaitre

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the code is simplified by supporting all strategies. Although, enabling subsampling for all strategies is backward incompatible. The safer path is to go through another deprecation cycle but for the other strategies.

.. versionadded:: 0.24
subsample : int or None (default='warn')
subsample : int or None, default=200_000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that subsample is enabled for all strategies, we need to adjust the docstring to not focus on quantile.

@jeremiedbb
Copy link
Member Author

Although, enabling subsampling for all strategies is backward incompatible. The safer path is to go through another deprecation cycle but for the other strategies.

It's true that KBD(strategy="kmeans") used to not do subsampling and would silently start using subsampling in the current state of this PR. Since the new default is quite large (200000), it won't impact many users, and the change would be very small for users with that many samples or more, so I'm okay to just add an entry in the changed models section.

Alternatively, what do you think about setting the default to "auto" which would be 200000 for "quantile" and None for the other strategies ? Then we can deprecate to set the default to 200000 for all strategies in 1.5 but maybe not worth the burden.

@thomasjpfan
Copy link
Member

Alternatively, what do you think about setting the default to "auto" which would be 200000 for "quantile" and None for the other strategies

I was thinking of leaving it as "warn" which sets subsample to 20000 for "quantile" and None for the other strategies. The warning states that the default will change to 20000 for all strategies in v1.5.

I am leaning toward (+0.5) going through another deprecation cycle for the other strategies.

@jeremiedbb
Copy link
Member Author

I was thinking of leaving it as "warn" which sets subsample to 20000 for "quantile" and None for the other strategies. The warning states that the default will change to 20000 for all strategies in v1.5.

I chose this path. I was not very happy to introduce an unnecessary breaking change.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side.

Co-authored-by: Guillaume Lemaitre <[email protected]>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! LGTM

@thomasjpfan thomasjpfan merged commit 199cdfd into scikit-learn:main May 31, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants