Skip to content

check_array does not raise error when input contains something other than numbers or strings #11401

@jeremiedbb

Description

@jeremiedbb

Now that imputers allow inputs with object dtype, e.g. strings or pandas categoricals, it seems that either check_array should be enhanced or that some common tests should be updated.
There is common test, check_dtype_object, that checks the estimators on input X that contains numbers and X[0,0] = {'foo':'bar'}. When expecting numeric inputs, the check_array is instanciated with dtype='numeric' and an error is raised as expected.
However, when instanciated with dtype=None or dtype=object, no error is raised. See the code below:

X = np.array([{'foo':'bar'}, "a", "b", "c"], dtype=object).reshape(-1, 1)
X
>>> array([[{'foo': 'bar'}],
           ['a'],
           ['b'],
           ['c']], dtype=object)
imputer = SimpleImputer(strategy='constant', missing_values='a')
imputer.fit_transform(X)
>>> array([[{'foo': 'bar'}],
           ['missing_value'],
           ['b'],
           ['c']], dtype=object)

No error is raised and the estimator works fine. Don't you think that we should raise an error in that case ?
This currently passes the test because when imputing on inputs with object dtypes, we can't set dtype='numeric' in check_array. I think the error should be raised even with dtype=object or dtype=None.

We could check that in the fit function of the estimators that accept non-numeric inputs, but I think it the role of check_array to do that. What's your opinion about that ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions