Skip to content

Multiclass and multilabel classifiers should accept arrays with string labels with dtype=object #2374

@ogrisel

Description

@ogrisel

As numpy does not have dtype for variable length strings, is very common to use dtype=object for arrays of strings so as to no waste memory: the default fixed width string dtype of numpy allocates zero padded memory otherwise.

However in sklearn 0.14, the sklearn.multiclass.type_of_target function explicitly rejects:

if y.ndim > 2 or y.dtype == object:
        return 'unknown'

In consequence it's possible to have: y = ['cat', 'dog', 'fish'], but not y = np.asarray(['cat', 'dog', 'fish', dtype=object]) anymore (it used to work in 0.13).

Note that np.array(list_of_string, dtype=object) is a necessary idiom (instead of just using list_of_string directly) to do cross-validation or other fancy indexing operations.

I think we should accept y to have dtype=object if and only if all(isinstance(y_i, (six.text_type, six.binary_type)) for y_i in y.ravel()).

This regression was found in the sklearn_pandas project: scikit-learn-contrib/sklearn-pandas#2

WDYT @arjoly ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions