-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Describe the workflow you want to enable
Out-of-Bag (OOB) scoring provides an estimate of the model generalizability for RandomForest without needing to refit the model several times as is demanded by k-fold cross validation (CV). Although sklearn provides a mechanism to obtain this estimate, it does not provide a mechanism to integrate it into the existing cross validation workflows. For example, we might have a GridSearchCV where we want to optimise hyperparameters for the forest, but only fitting once per parameter set. This in theory could be implemented using OOB error.
As far as I can, the two parameters of interest here are cv and scoring, both inputs to all the CV-related classes, and ultimately to cross_val_score(). scoring can be implemented easily enough using a custom scorer, since this has access to the final estimator and therefore the OOB error. What is problematic here is the cv argument, which requires that we split the dataset, and offers no alternative.
Describe your proposed solution
- We add
sklearn.metrics.oob, a scoring function that just returns the oob error on the trained classifier - We add
sklearn.model_selection.IntegratedCV, which is a cross validator that does not split the data at all. ieIntegratedCV().split(X)will returnXunchanged
With the combination of these two entities, users will be able to perform OOB-based cross-validation
Describe alternatives you've considered, if relevant
It is possible to apply general cross validation metrics to a RandomForest, such as k-folds. This is an alternative that already exists in sklearn today. However we are neglecting the significant (k times) speedup that could be obtained using OOB error.
Additional context
This question is notably discussed in these threads:
- https://datascience.stackexchange.com/a/66238/83633
- https://datascience.stackexchange.com/questions/37393/scikitlearn-grid-search-random-forest-using-oob-as-metric
- https://datascience.stackexchange.com/questions/76304/gridsearchcv-with-random-forest-classifier
Metadata
Metadata
Assignees
Labels
Type
Projects
Status