BUG: always raise on NaN in OneHotEncoder for object dtype data#12033
BUG: always raise on NaN in OneHotEncoder for object dtype data#12033rth merged 4 commits intoscikit-learn:masterfrom
Conversation
sklearn/preprocessing/_encoders.py
Outdated
| X = X_temp | ||
|
|
||
| if X.dtype == np.dtype('object'): | ||
| if _object_dtype_isnan(X).any(): |
There was a problem hiding this comment.
Technically this shouldn't happen if the assume_all_finite configuration is set
There was a problem hiding this comment.
Added a check for that.
Is there a way to test this? I could pass data with NaNs with the config set to assume all finite, but then the implementation fails, which seems a bit strange to test?
| def test_one_hot_encoder_raise_missing(X, handle_unknown): | ||
| ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown) | ||
|
|
||
| with pytest.raises(ValueError): |
There was a problem hiding this comment.
Maybe add match="Input contains NaN" here and below to be more explicit about the error we are expecting
| ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown) | ||
|
|
||
| with pytest.raises(ValueError): | ||
| ohe.fit(X) |
There was a problem hiding this comment.
I though (reading through your issue) this error would be raised only for X.dtype == 'object' not e.g. 'float'?
There was a problem hiding this comment.
for float it already raises correctly on master (on a first check_array when calling fit), it was only not explicitly tested. So the bug was only for object dtype.
jnothman
left a comment
There was a problem hiding this comment.
Is this not something check_array should be doing?
Yes, I was also thinking that, but forgot to add a comment asking about it. Basically, So I can certainly move it there, I am only not fully sure if it would have wider impact (are there other estimators that accept object dtype data apart from imputers and encoders?), and we need to agree on an API to specify this in |
jnothman
left a comment
There was a problem hiding this comment.
Okay with this for now, but we should probably put this into check_array soon
Closes #12018
This does not really solve the underlying issue that the
valid_masklogic (for handling unknown categories) needs to become NaN-aware, but this PR (by checking for NaNs and then raising) will prevent that we get to that point if there are NaNs in the data.When we change the default of raising to passing through the NaNs, the masking logic will need to be reworked anyway.