-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Currently the common test hard-code many things, like support for multi-output, or requiring positive input, or not allowing specific kinds of predictions.
That's bad design, but also a big problem for 3rd party packages that need to add to the conditions in the common tests (you need to add to the hard-coded list of estimators with a certain property).
See scikit-learn-contrib/py-earth#96 for an example.
Currently we have a very poor mechanism for distinguishing classifiers and regressors for similar purposes (but also to use when deciding the default cross-validation strategy), the estimator_type attribute. That allows only a single tag (classifier, transformer, regressor).
I think we should deprecate estimator_type and instead add a more flexible estimator_properties dictionary.
This will allow us to programmatically encode assumptions of the algorithms (like vectorizers taking non-numeric data or NB taking non-negative data) as well as clean up our act with the tests.
The people wanting to add to scikit-learn-contrib and auto-sklearn-like settings (tpot) will appreciate that ;)
List of tags that I am coming up with
- supports sparse data
- positive data only
- supports missing data
- semi-supervised
- multi-output only
- multi-label support
- multi-output regression
- multi-label multi-output
- 1d input only
- multi-class support (or maybe "no multi-class support"?)
- needs fitting (or maybe "stateless"? though the GP doesn't need fitting but is not stateless)
- input dtype / dtype conversions?
- sparse matrix formats / conversions
- deterministic?
- label transformation (not for data)
- special input format? like for CountVectorizer and DictVectorizer? Or maybe we want a field "supported inputs" that lists ndarray, sparse formats, list, strings, dicts?
- required parameters ?
- integer input / categorical input supported?
cc @GaelVaroquaux @ogrisel @mblondel @jnothman @mfeurer @rhiever