ENH Adds missing value support to OneHotEncoder by thomasjpfan · Pull Request #17317 · scikit-learn/scikit-learn

thomasjpfan · 2020-05-23T20:43:31Z

Reference Issues/PRs

Fixes #11996
Closes #15009
Closes #13028
Towards #15796

What does this implement/fix? Explain your changes.

Adds missing value support to OneHotEncoder.

For numerical data, np.nan is represents missing values. For object dtypes, None and np.nan is support for missing values.

amueller

Looks good but needs docs and a whatsnew. Also, you should explain why using None and NaN in a single column is prohibited.

amueller · 2020-06-15T21:37:20Z

            missing_drops = [(i, val) for i, val in enumerate(self.drop)
                             if val not in self.categories_[i]]
+
+            missing_drops = []


Doesn't the missing_drops implementation from two lines above still work? And if not you should at least remove it? but set operations should work on NaN, right?

this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.

Since self.categories_[i] is a numpy array, I was trying to avoid converting it into another data structure and making a copy.

amueller · 2020-06-15T21:45:27Z

+        if none_in_diff and nan_in_diff:
+            raise ValueError("Input wiith both types of missing, None and "
+                             "np.nan, is not supported")
+        if none_in_diff:


Why is this necessary? Does sort fail to sort? Can you add a comment?

This comment may have lost its context. The nans were removed from the sets above because they do not work with set differencing in this context.

As for the Nones, we could have left the Nones in the set, but it would complicate the _extract_missing helper.

…missing_category

thomasjpfan · 2020-06-24T04:14:46Z

Updated PR with:

Support for both np.nan and None support for object dtype.
Updated examples that do not need an imputer anymore before the OneHotEncoder.

NicolasHug · 2020-06-24T10:31:55Z

What's the motivation for treating None as missing?

…missing_category

jnothman · 2020-06-27T11:36:51Z

Test failures, @thomasjpfan

thomasjpfan · 2020-06-27T14:01:08Z

Test failures are fixed!

amueller

Looks good. I am still slightly hopeful two places I comment on can be simplified, otherwise ready to merge I'd say :)

amueller · 2020-07-02T22:07:42Z

            missing_drops = [(i, val) for i, val in enumerate(self.drop)
                             if val not in self.categories_[i]]
+
+            missing_drops = []


this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.

amueller · 2020-07-02T22:24:09Z

    try:
-        uniques = sorted(set(values))
+        uniques_set = set(values)
+        missing_values_in_set = [value for value in (None, np.nan)


is None not sortable?

It is not:

sorted(['a', 'b', None]) # TypeError

This won't match float('nan'), nor (np.array(0)/0).item(). Does it matter?

To be careful, I think you might be safer with:

missing_values_in_set = [value for value in uniques_set if value is None or is_scalar_nan(value)]

amueller · 2020-07-02T22:43:18Z

+def test_ohe_missing_values_both_missing_values():
+    # test both types of missing of missing values are treated as its own
+    # category
+    X = np.array([['a', 'b', None, 'a', np.nan]], dtype=object).T


not sure if it's necessary but having np.nan twice would check the deduplication logic (again, I know it's covered in other tests as well).

…missing_category

jnothman

I think I'm done here... Lgtm!

Let's solicit another review...

jnothman · 2020-09-24T23:05:44Z

We have ?4 weeks for another review to push this into 0.24 ... volunteers?

…missing_category

agramfort

besides LGTM

maybe I would not have done with a new MissingValues class but I understand the argument to enforce typing.

agramfort · 2020-09-28T12:44:53Z

+    >>> enc.transform(X).toarray()
+    array([[0., 1., 0., 0., 1., 0.],
+           [1., 0., 0., 0., 0., 1.],
+           [0., 0., 1., 1., 0., 0.]])


can you add a note here about what happens if None and np.nan are present in the same column?

I would also pass explicitly here the handle_unknown parameter. To document it.

Updated PR:

The user guide now describes what happens when nan and None are present.

The default value of handle_unknown='error' is passed explicitly.

…missing_category

agramfort

LGTM thx @thomasjpfan !

cmarmo · 2020-10-08T10:16:17Z

Two approvals here and the opportunity to close three issues and moving forward with the milestone!
@thomasjpfan do you mind syncing (just to be sure) and then this could probably be merged?

…missing_category

OneHotEncoder supports categorical features with missing values by considering the missing values as an additional category.

amueller · 2020-10-28T19:35:34Z

OMG THIS WAS MERGED!!! I'm so happy!

jnothman · 2020-11-01T12:43:14Z

Yes, thanks @thomasjpfan for putting in the hard work to make this finally happen!!

thomasjpfan added 3 commits May 23, 2020 15:34

ENH Adds support for missing values

56e1487

BUG Fix

9317a82

DOC More docstrings

d6dbcb2

github-actions bot added module:preprocessing module:utils labels May 23, 2020

DOC Adds comments

36ae6ac

amueller reviewed Jun 15, 2020

View reviewed changes

amueller mentioned this pull request Jun 17, 2020

inconsistent treatment of None and np.NaN in SimpleImputer #17625

Open

thomasjpfan added 5 commits June 20, 2020 18:01

Merge remote-tracking branch 'upstream/master' into one_hot_encoding_…

0d351d7

…missing_category

DOC Adds to user guide

1ec3d59

ENH Remove imputer from example

1d73eb6

ENH Support for both nan and None

9f747ac

CLN Remove unneeded comment

9f51464