-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Hi all,
I am doing some profiling for random forest classifiers prediction.
A simple predict_proba for a single input on an RF calls a check_is_fitted on the RF and then calls predict_proba on all 100 internal trees which in turn call check_is_fitted on each tree, etc ... check_is_fitted is simply not adapted for ensembles. check_is_fitted is called hundreds of times in sklearn code as if it was a non-costly function.
My understanding is that if an RF is fitted (fit() called once), then we can without a lot of risk assume that all its internal estimators (trees) are already fitted ;)
In my case, the performance problem is that the current implementation of check_is_fitted is independent of the estimator used (validation.py) and calls some generic functions on the python object and checks if an attribute with some pattern of variable naming is present (some kind of tribal knowledge if not a simple convention). Listing all members can be costly : depends on the type and the complexity of the estimator (=> performance issue).
Some possible fix is to make each estimator aware of the fact that it is already fitted (RF checks the presence of the member 'estimators_' and nothing more, the tree checks the member 'tree_', the Ridge checks 'coef_' etc). A new service estimator.is_fitted() can be added and implemented for each estimator (not that much work for a boolean method).
Please get rid of this function !!!
Thanks in advance,
Antoine