Stratifying Across Classes in ShuffleSplit

Hello sk'ers,

**Great work guys** in developing scikit-learn!

An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as [ShuffleSplit](https://github.com/scikit-learn/scikit-learn/blob/f8f0d2dbc70ec9d165082add796812195fab6af3/sklearn/model_selection/_split.py#L839) here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the  classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in [achieving balanced sensitivity and specificity](http://www.sciencedirect.com/science/article/pii/S0197458014005521) for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.

This can be achieved by having an optional flag, such as 

``` python
stratify_across_classes_in_training_set=True
```

to StratifiedShuffledSplit, which would act differently in this [place](https://github.com/scikit-learn/scikit-learn/blob/f8f0d2dbc70ec9d165082add796812195fab6af3/sklearn/model_selection/_split.py#L1060)

``` python
 n_i = np.round(n_train * p_i).astype(int)
```

Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV. 
### References:
1. Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
2. Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial  ….
3. Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
4. Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Stratifying Across Classes in ShuffleSplit #5965

References:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Stratifying Across Classes in ShuffleSplit #5965

Description

References:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions