Skip to content

Conversation

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Mar 31, 2019

Reference Issues/PRs

Closes #13507

What does this implement/fix? Explain your changes.

This PR adds a stratify option to utils.resample. The issue with train_test_split is that it will (rightfully) complain if train or testsets are empty.

The code is based on that of StratifiedShuffleSplit.

Any other comments?

I personally need this to properly implement SuccessiveHalving #12538

return bool(isinstance(x, numbers.Real) and np.isnan(x))


def _approximate_mode(class_counts, n_draws, rng):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this deserves a clearer name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

draw_from_class_counts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a random draw, since we actually want the mode. Sorry... not coming up w good names here either

random_state.shuffle(indices)
indices = indices[:max_n_samples]
# Code adapted from StratifiedShuffleSplit()
y = stratify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether there is a better way to share the code/logic with StratifiedShuffleSplit. Am I right to think the difficulty stems from the use of permutation + slice in ShuffleSplit, which we don't want here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I removed the permutation + slice logic because it's simpler to use np.random.choice, but could have kept it.

The real need for this is that we want to avoid the checks for train / test set sizes that are in StratifiedShuffleSplit()

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this. Please add to what's new.

n_samples = 100
X = rng.normal(size=(n_samples, 1))
y = rng.randint(0, 2, size=(n_samples, 2))
resample(X, y, n_samples=50, random_state=rng, stratify=y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check the shape of y.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can have a better test

@glemaitre glemaitre merged commit 14bdb9d into scikit-learn:master Apr 24, 2019
@glemaitre
Copy link
Member

@NicolasHug Thanks!!! Going forward for SuccessiveHalving :)

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stratified subsampler utility?

3 participants