Skip to content

Implement estimator tags #6599

@amueller

Description

@amueller

Currently the common test hard-code many things, like support for multi-output, or requiring positive input, or not allowing specific kinds of predictions.
That's bad design, but also a big problem for 3rd party packages that need to add to the conditions in the common tests (you need to add to the hard-coded list of estimators with a certain property).
See scikit-learn-contrib/py-earth#96 for an example.

Currently we have a very poor mechanism for distinguishing classifiers and regressors for similar purposes (but also to use when deciding the default cross-validation strategy), the estimator_type attribute. That allows only a single tag (classifier, transformer, regressor).

I think we should deprecate estimator_type and instead add a more flexible estimator_properties dictionary.
This will allow us to programmatically encode assumptions of the algorithms (like vectorizers taking non-numeric data or NB taking non-negative data) as well as clean up our act with the tests.
The people wanting to add to scikit-learn-contrib and auto-sklearn-like settings (tpot) will appreciate that ;)

List of tags that I am coming up with

  • supports sparse data
  • positive data only
  • supports missing data
  • semi-supervised
  • multi-output only
  • multi-label support
  • multi-output regression
  • multi-label multi-output
  • 1d input only
  • multi-class support (or maybe "no multi-class support"?)
  • needs fitting (or maybe "stateless"? though the GP doesn't need fitting but is not stateless)
  • input dtype / dtype conversions?
  • sparse matrix formats / conversions
  • deterministic?
  • label transformation (not for data)
  • special input format? like for CountVectorizer and DictVectorizer? Or maybe we want a field "supported inputs" that lists ndarray, sparse formats, list, strings, dicts?
  • required parameters ?
  • integer input / categorical input supported?

cc @GaelVaroquaux @ogrisel @mblondel @jnothman @mfeurer @rhiever

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions