-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Hello sk'ers,
Great work guys in developing scikit-learn!
An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.
This can be achieved by having an optional flag, such as
stratify_across_classes_in_training_set=Trueto StratifiedShuffledSplit, which would act differently in this place
n_i = np.round(n_train * p_i).astype(int)Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.
References:
- Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
- Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial ….
- Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
- Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040