ENH Adds missing value support to OneHotEncoder#17317
ENH Adds missing value support to OneHotEncoder#17317lorentzenchr merged 57 commits intoscikit-learn:masterfrom
Conversation
amueller
left a comment
There was a problem hiding this comment.
Looks good but needs docs and a whatsnew. Also, you should explain why using None and NaN in a single column is prohibited.
| missing_drops = [(i, val) for i, val in enumerate(self.drop) | ||
| if val not in self.categories_[i]] | ||
|
|
||
| missing_drops = [] |
There was a problem hiding this comment.
Doesn't the missing_drops implementation from two lines above still work? And if not you should at least remove it? but set operations should work on NaN, right?
There was a problem hiding this comment.
this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.
There was a problem hiding this comment.
Since self.categories_[i] is a numpy array, I was trying to avoid converting it into another data structure and making a copy.
| if none_in_diff and nan_in_diff: | ||
| raise ValueError("Input wiith both types of missing, None and " | ||
| "np.nan, is not supported") | ||
| if none_in_diff: |
There was a problem hiding this comment.
Why is this necessary? Does sort fail to sort? Can you add a comment?
There was a problem hiding this comment.
This comment may have lost its context. The nans were removed from the sets above because they do not work with set differencing in this context.
As for the Nones, we could have left the Nones in the set, but it would complicate the _extract_missing helper.
|
Updated PR with:
|
|
What's the motivation for treating None as missing? |
|
Test failures, @thomasjpfan |
|
Test failures are fixed! |
amueller
left a comment
There was a problem hiding this comment.
Looks good. I am still slightly hopeful two places I comment on can be simplified, otherwise ready to merge I'd say :)
| missing_drops = [(i, val) for i, val in enumerate(self.drop) | ||
| if val not in self.categories_[i]] | ||
|
|
||
| missing_drops = [] |
There was a problem hiding this comment.
this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.
| try: | ||
| uniques = sorted(set(values)) | ||
| uniques_set = set(values) | ||
| missing_values_in_set = [value for value in (None, np.nan) |
There was a problem hiding this comment.
It is not:
sorted(['a', 'b', None])
# TypeErrorThere was a problem hiding this comment.
This won't match float('nan'), nor (np.array(0)/0).item(). Does it matter?
To be careful, I think you might be safer with:
missing_values_in_set = [value for value in uniques_set
if value is None or is_scalar_nan(value)]| def test_ohe_missing_values_both_missing_values(): | ||
| # test both types of missing of missing values are treated as its own | ||
| # category | ||
| X = np.array([['a', 'b', None, 'a', np.nan]], dtype=object).T |
There was a problem hiding this comment.
not sure if it's necessary but having np.nan twice would check the deduplication logic (again, I know it's covered in other tests as well).
jnothman
left a comment
There was a problem hiding this comment.
I think I'm done here... Lgtm!
Let's solicit another review...
|
We have ?4 weeks for another review to push this into 0.24 ... volunteers? |
agramfort
left a comment
There was a problem hiding this comment.
besides LGTM
maybe I would not have done with a new MissingValues class but I understand the argument to enforce typing.
| >>> enc.transform(X).toarray() | ||
| array([[0., 1., 0., 0., 1., 0.], | ||
| [1., 0., 0., 0., 0., 1.], | ||
| [0., 0., 1., 1., 0., 0.]]) |
There was a problem hiding this comment.
can you add a note here about what happens if None and np.nan are present in the same column?
There was a problem hiding this comment.
I would also pass explicitly here the handle_unknown parameter. To document it.
There was a problem hiding this comment.
Updated PR:
- The user guide now describes what happens when
nanandNoneare present. - The default value of
handle_unknown='error'is passed explicitly.
|
Two approvals here and the opportunity to close three issues and moving forward with the milestone! |
OneHotEncoder supports categorical features with missing values by considering the missing values as an additional category.
OneHotEncoder supports categorical features with missing values by considering the missing values as an additional category.
|
OMG THIS WAS MERGED!!! I'm so happy! |
|
Yes, thanks @thomasjpfan for putting in the hard work to make this finally happen!! |
Reference Issues/PRs
Fixes #11996
Closes #15009
Closes #13028
Towards #15796
What does this implement/fix? Explain your changes.
Adds missing value support to
OneHotEncoder.For numerical data,
np.nanis represents missing values. For object dtypes,Noneandnp.nanis support for missing values.