Skip to content

Stratifying Across Classes in ShuffleSplit #5965

@raamana

Description

@raamana

Hello sk'ers,

Great work guys in developing scikit-learn!

An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.

This can be achieved by having an optional flag, such as

stratify_across_classes_in_training_set=True

to StratifiedShuffledSplit, which would act differently in this place

 n_i = np.round(n_train * p_i).astype(int)

Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.

References:

  1. Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
  2. Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial ….
  3. Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
  4. Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions