Skip to content

check_is_fitted and validate_date are a performance bottlenck for ensembles prediction #16653

@antoinecarme

Description

@antoinecarme

Hi all,

I am doing some profiling for random forest classifiers prediction.

A simple predict_proba for a single input on an RF calls a check_is_fitted on the RF and then calls predict_proba on all 100 internal trees which in turn call check_is_fitted on each tree, etc ... check_is_fitted is simply not adapted for ensembles. check_is_fitted is called hundreds of times in sklearn code as if it was a non-costly function.

My understanding is that if an RF is fitted (fit() called once), then we can without a lot of risk assume that all its internal estimators (trees) are already fitted ;)

In my case, the performance problem is that the current implementation of check_is_fitted is independent of the estimator used (validation.py) and calls some generic functions on the python object and checks if an attribute with some pattern of variable naming is present (some kind of tribal knowledge if not a simple convention). Listing all members can be costly : depends on the type and the complexity of the estimator (=> performance issue).

Some possible fix is to make each estimator aware of the fact that it is already fitted (RF checks the presence of the member 'estimators_' and nothing more, the tree checks the member 'tree_', the Ridge checks 'coef_' etc). A new service estimator.is_fitted() can be added and implemented for each estimator (not that much work for a boolean method).

Please get rid of this function !!!

Thanks in advance,
Antoine

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions