MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424

jeremiedbb · 2023-05-24T12:51:04Z

The value is meant to change to 20000 for the 1.3 release.
When implementing the change I faced an error because we used to not support subsampling for other strategies than "quantile". This is an issue because we are setting the default to use subsampling.

Looking at this, I don't see any reason not to support subsampling for the "kmeans" and "uniform" strategies, especially since we set the default value very high. Note that there was no test for the behavior of subsampling, so I added a simple one to check that the bin edges a somewhat close the the ones obtained without subsampling. I propose to now support subsampling for all strategies.

jeremiedbb · 2023-05-24T12:52:52Z

ping @glemaitre

thomasjpfan

I like how the code is simplified by supporting all strategies. Although, enabling subsampling for all strategies is backward incompatible. The safer path is to go through another deprecation cycle but for the other strategies.

thomasjpfan · 2023-05-24T14:11:29Z

sklearn/preprocessing/_discretization.py

        .. versionadded:: 0.24

-    subsample : int or None (default='warn')
+    subsample : int or None, default=200_000


Now that subsample is enabled for all strategies, we need to adjust the docstring to not focus on quantile.

doc/whats_new/v1.3.rst

jeremiedbb · 2023-05-24T14:42:59Z

Although, enabling subsampling for all strategies is backward incompatible. The safer path is to go through another deprecation cycle but for the other strategies.

It's true that KBD(strategy="kmeans") used to not do subsampling and would silently start using subsampling in the current state of this PR. Since the new default is quite large (200000), it won't impact many users, and the change would be very small for users with that many samples or more, so I'm okay to just add an entry in the changed models section.

Alternatively, what do you think about setting the default to "auto" which would be 200000 for "quantile" and None for the other strategies ? Then we can deprecate to set the default to 200000 for all strategies in 1.5 but maybe not worth the burden.

thomasjpfan · 2023-05-25T18:28:17Z

Alternatively, what do you think about setting the default to "auto" which would be 200000 for "quantile" and None for the other strategies

I was thinking of leaving it as "warn" which sets subsample to 20000 for "quantile" and None for the other strategies. The warning states that the default will change to 20000 for all strategies in v1.5.

I am leaning toward (+0.5) going through another deprecation cycle for the other strategies.

jeremiedbb · 2023-05-26T13:26:17Z

I was thinking of leaving it as "warn" which sets subsample to 20000 for "quantile" and None for the other strategies. The warning states that the default will change to 20000 for all strategies in v1.5.

I chose this path. I was not very happy to introduce an unnecessary breaking change.

glemaitre

LGTM on my side.

doc/whats_new/v1.3.rst

sklearn/preprocessing/_discretization.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

thomasjpfan

Thank you for the update! LGTM

…BinsDiscretizer (scikit-learn#26424) Co-authored-by: Guillaume Lemaitre <[email protected]>

change default + enable subsampling for all strategies

226394d

jeremiedbb added the API label May 24, 2023

jeremiedbb added this to the 1.3 milestone May 24, 2023

github-actions bot added the module:preprocessing label May 24, 2023

update changelog

e185f05

thomasjpfan reviewed May 24, 2023

View reviewed changes

deprecation cycle to change default for other strategies

2bd8681

jeremiedbb added 2 commits May 26, 2023 16:41

filter warnings in tests

7c8c2e0

fix doctest

b780c37

glemaitre approved these changes May 31, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2548941

Co-authored-by: Guillaume Lemaitre <[email protected]>

thomasjpfan approved these changes May 31, 2023

View reviewed changes

thomasjpfan merged commit 199cdfd into scikit-learn:main May 31, 2023

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH Change default value of subsample + allow for all strategies in K…

20c9327

…BinsDiscretizer (scikit-learn#26424) Co-authored-by: Guillaume Lemaitre <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424

MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424

jeremiedbb commented May 24, 2023

Uh oh!

jeremiedbb commented May 24, 2023

Uh oh!

thomasjpfan left a comment

Uh oh!

thomasjpfan May 24, 2023

Uh oh!

Uh oh!

jeremiedbb commented May 24, 2023

Uh oh!

thomasjpfan commented May 25, 2023

Uh oh!

jeremiedbb commented May 26, 2023

Uh oh!

glemaitre left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424

MAINT | ENH Change default value of subsample + allow for all strategies in KBinsDiscretizer #26424

Conversation

jeremiedbb commented May 24, 2023

Uh oh!

jeremiedbb commented May 24, 2023

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb commented May 24, 2023

Uh oh!

thomasjpfan commented May 25, 2023

Uh oh!

jeremiedbb commented May 26, 2023

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants