Multiclass and multilabel classifiers should accept arrays with string labels with dtype=object

As numpy does not have dtype for variable length strings, is very common to use `dtype=object` for arrays of strings so as to no waste memory: the default fixed width string dtype of numpy allocates zero padded memory otherwise.

However in sklearn 0.14, the `sklearn.multiclass.type_of_target` function explicitly rejects:

``` python
if y.ndim > 2 or y.dtype == object:
        return 'unknown'
```

In consequence it's possible to have: `y =  ['cat', 'dog', 'fish']`, but not `y = np.asarray(['cat', 'dog', 'fish', dtype=object])` anymore (it used to work in 0.13).

Note that `np.array(list_of_string, dtype=object)` is a necessary idiom (instead of just using `list_of_string` directly) to do cross-validation or other fancy indexing operations.

I think we should accept `y` to have dtype=object if and only if `all(isinstance(y_i, (six.text_type, six.binary_type)) for y_i in y.ravel())`.

This regression was found in the sklearn_pandas project: https://github.com/paulgb/sklearn-pandas/issues/2

WDYT @arjoly ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multiclass and multilabel classifiers should accept arrays with string labels with dtype=object #2374

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multiclass and multilabel classifiers should accept arrays with string labels with dtype=object #2374

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions