-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424
Conversation
|
ping @glemaitre |
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how the code is simplified by supporting all strategies. Although, enabling subsampling for all strategies is backward incompatible. The safer path is to go through another deprecation cycle but for the other strategies.
| .. versionadded:: 0.24 | ||
| subsample : int or None (default='warn') | ||
| subsample : int or None, default=200_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that subsample is enabled for all strategies, we need to adjust the docstring to not focus on quantile.
It's true that Alternatively, what do you think about setting the default to "auto" which would be 200000 for "quantile" and |
I was thinking of leaving it as "warn" which sets I am leaning toward (+0.5) going through another deprecation cycle for the other strategies. |
I chose this path. I was not very happy to introduce an unnecessary breaking change. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on my side.
Co-authored-by: Guillaume Lemaitre <[email protected]>
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update! LGTM
…BinsDiscretizer (scikit-learn#26424) Co-authored-by: Guillaume Lemaitre <[email protected]>
The value is meant to change to 20000 for the 1.3 release.
When implementing the change I faced an error because we used to not support subsampling for other strategies than "quantile". This is an issue because we are setting the default to use subsampling.
Looking at this, I don't see any reason not to support subsampling for the "kmeans" and "uniform" strategies, especially since we set the default value very high. Note that there was no test for the behavior of subsampling, so I added a simple one to check that the bin edges a somewhat close the the ones obtained without subsampling. I propose to now support subsampling for all strategies.