-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG] Add a stratify option to utils.resample #13549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return bool(isinstance(x, numbers.Real) and np.isnan(x)) | ||
|
|
||
|
|
||
| def _approximate_mode(class_counts, n_draws, rng): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this deserves a clearer name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draw_from_class_counts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a random draw, since we actually want the mode. Sorry... not coming up w good names here either
sklearn/utils/__init__.py
Outdated
| random_state.shuffle(indices) | ||
| indices = indices[:max_n_samples] | ||
| # Code adapted from StratifiedShuffleSplit() | ||
| y = stratify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether there is a better way to share the code/logic with StratifiedShuffleSplit. Am I right to think the difficulty stems from the use of permutation + slice in ShuffleSplit, which we don't want here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, I removed the permutation + slice logic because it's simpler to use np.random.choice, but could have kept it.
The real need for this is that we want to avoid the checks for train / test set sizes that are in StratifiedShuffleSplit()
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this. Please add to what's new.
sklearn/utils/tests/test_utils.py
Outdated
| n_samples = 100 | ||
| X = rng.normal(size=(n_samples, 1)) | ||
| y = rng.randint(0, 2, size=(n_samples, 2)) | ||
| resample(X, y, n_samples=50, random_state=rng, stratify=y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably check the shape of y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can have a better test
|
@NicolasHug Thanks!!! Going forward for SuccessiveHalving :) |
…)" This reverts commit be9bbc6.
…)" This reverts commit be9bbc6.
Reference Issues/PRs
Closes #13507
What does this implement/fix? Explain your changes.
This PR adds a
stratifyoption toutils.resample. The issue withtrain_test_splitis that it will (rightfully) complain if train or testsets are empty.The code is based on that of
StratifiedShuffleSplit.Any other comments?
I personally need this to properly implement SuccessiveHalving #12538