[MRG] Add get_feature_names to OneHotEncoder #10198
[MRG] Add get_feature_names to OneHotEncoder #10198amueller merged 10 commits intoscikit-learn:masterfrom
Conversation
|
I'm getting a i.e line 164 in Is this a bug or is it just my code? |
|
apparently there is a tab character in your additions. There should not be.
Remove it, please!
|
jnothman
left a comment
There was a problem hiding this comment.
This requires a test in sklearn/preprocessing/tests/test_data.py
|
|
||
| feature_names = np.concatenate(feature_names) | ||
| return feature_names | ||
|
|
There was a problem hiding this comment.
please do not leave a blank line at the end of the file
sklearn/preprocessing/data.py
Outdated
|
|
||
| return X_tr | ||
|
|
||
| def get_feature_names(self, input_features=None): |
There was a problem hiding this comment.
this should be defined as a method within the class, not as an external function
sklearn/preprocessing/data.py
Outdated
| temp.append(str(t)) | ||
| cats2.append(np.array(temp)) | ||
|
|
||
|
|
There was a problem hiding this comment.
This line has spaces in it, which should be removed. It also has a tab character.
39216f0 to
4cd0e96
Compare
bca6cdb to
d7e6a47
Compare
| feature_names2 = enc.get_feature_names(['one', 'two', | ||
| 'three', 'four', 'five']) | ||
|
|
||
| assert_array_equal(['one_Female', 'one_Male', |
There was a problem hiding this comment.
test failures related to pep8 and whitespace around here.
amueller
left a comment
There was a problem hiding this comment.
Works but I think the implementation is a bit needlessly complicated.
sklearn/preprocessing/data.py
Outdated
| return X_tr | ||
|
|
||
| def get_feature_names(self, input_features=None): | ||
| """ |
sklearn/preprocessing/data.py
Outdated
|
|
||
| Returns | ||
| ------- | ||
| output_feature_names : list of string, length n_output_features |
There was a problem hiding this comment.
Maybe a very short doctest example would be illustrative?
There was a problem hiding this comment.
I have added it under the class declaration, please check.
sklearn/preprocessing/data.py
Outdated
| input_features = ['x%d' % i for i in range(len(cats))] | ||
|
|
||
| cats2 = [] | ||
| for i in range(len(cats)): |
There was a problem hiding this comment.
I don't understand why you iterate over cats twice
sklearn/preprocessing/data.py
Outdated
| t = [input_features[i] + '_' + f for f in cats2[i]] | ||
| feature_names.append(t) | ||
|
|
||
| feature_names = np.concatenate(feature_names).tolist() |
There was a problem hiding this comment.
thats a weird way to join lists: you converte them to numpy arrays, concatenate them, and convert them back to lists.
You could probably just do a single list comprehension for everything.
1b10208 to
423302d
Compare
jnothman
left a comment
There was a problem hiding this comment.
This strategy is only applicable if encoding='onehot' or encoding='onehot-dense'
sklearn/preprocessing/data.py
Outdated
| """ | ||
| cats = self.categories_ | ||
| feature_names = [] | ||
| if input_features is None: |
There was a problem hiding this comment.
if not, please validate that input_features's length corresponds to self.categories_
sklearn/preprocessing/data.py
Outdated
| for t in cats[i]: | ||
| temp.append(str(t)) | ||
| t = [input_features[i] + '_' + f for f in temp] | ||
| feature_names.append(t) |
There was a problem hiding this comment.
why not just
feature_names.extend(input_features[i] + '_' + str(t)
for t in cats[i])
?
There was a problem hiding this comment.
I also think we should be using unicode here in python 2.
There was a problem hiding this comment.
I've added the 2 changes you mentioned. Will make a PR soon.I also tried with encoding = 'ordinal' and it worked with that too.
There was a problem hiding this comment.
I don't understand what you mean by using unicode here . Can you explain that better?
There was a problem hiding this comment.
No need to make a new PR, just push more commits to this branch, please.
There was a problem hiding this comment.
In Python 2, we need to use unicode not str, etc, to allow for non-ascii strings.
There was a problem hiding this comment.
@jnothman I am not sure this conversion to unicode is needed.
If a user has unicode categories or passes unicode input features, concatenating them still works as expected in python 2 (a mixture of string and unicode will just automatically be unicode as result).
Not doing this conversion AFAIK only means we can end up with a list of both string and unicode values in python 2.
There was a problem hiding this comment.
As long as we test for reasonable behaviour, I'm fine with it!
There was a problem hiding this comment.
OK, will do some more testing on this (I actually expect the current tests to fail on python 2, let's see)
b4f49e1 to
68e3f92
Compare
68e3f92 to
a0e91d7
Compare
|
I've added a check for input_features and also unicode support for python2. Please review. |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
I think this is looking good, added some minor comments.
@Nirvan101 do you have time to update this (on a short term)?
It would be nice to include this in the release.
sklearn/preprocessing/data.py
Outdated
| " length equal to number of features") | ||
|
|
||
| def to_unicode(text): | ||
| return text if isinstance(text, unicode) else text.encode('utf8') |
There was a problem hiding this comment.
I think you can use six.text_type here instead of the unicode (and redefining unicode depending on the version)
| 'three_boy', 'three_girl', | ||
| 'four_1', 'four_2', 'four_12', 'four_21', | ||
| 'five_3', 'five_10', 'five_30'], feature_names2) | ||
|
|
There was a problem hiding this comment.
Can you also add a test catching the error if the passed input_features are not correct length or not strings?
|
@Nirvan101 do you have time to update this on a short term ? |
|
@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up. |
Thanks for letting it know! |
|
This pull request introduces 1 alert when merging 4e17c86 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
4e17c86 to
f1ce05b
Compare
|
@jnothman AFAIK the added If that is the case, I think this PR is pretty good to go, and would be nice to include in 0.20.0, as it is really useful for the OneHotEncoder. |
|
Ah, it seems that |
|
@jorisvandenbossche yes, that's the only other place where that was done. We don't really have an API for making this work with pipelines or using pandas column names, but I think the optional |
|
@jorisvandenbossche so you're taking this over? It would be great to have! |
|
Yes, I can finish this. But I think it is as good as ready (so new review round is certainly welcome). I just want to check if we have enough tests for python 2 / unicode. |
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| for i in range(len(cats)): | ||
| feature_names.extend(to_unicode( | ||
| input_features[i] + '_' + str(t)) for t in cats[i]) |
There was a problem hiding this comment.
Is underscore a good choice? Should we allow users to specify this?
There was a problem hiding this comment.
I think it is a reasonable default (at least better than . or = IMO). Do you have other ideas? (double underscore, ..)
I think the option to specify it could be left for another PR (as long as we now choose what we would otherwise have as the default for that option).
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| Returns | ||
| ------- | ||
| output_feature_names : list of string, length n_output_features |
There was a problem hiding this comment.
do we want list of string? I think numpy arrays might be better. I hate the fact that CountVectorizer returns lists because I always need to convert before I can slice.
So it's a question of convenience vs consistency - but I might rather want to change that in CountVectorizer for 1.0 ;)
There was a problem hiding this comment.
I am fine with making this numpy arrays (of object dtype then). Since we are adding it here to a new class, seems the good time to do this, if we consider changing it later for CountVectorizer
sklearn/preprocessing/_encoders.py
Outdated
| input_features = ['x%d' % i for i in range(len(cats))] | ||
| elif(len(input_features) != len(self.categories_)): | ||
| raise ValueError("input_features should have" | ||
| " length equal to number of features") |
There was a problem hiding this comment.
error message should probably contain both values, right?
"input_features should have length equal to number of features {}, got {}".format(len(self.categories_), len(input_features))?
amueller
left a comment
There was a problem hiding this comment.
Haven't checked the unicode stuff but otherwise looks good.
|
Updated, and also changed to return array. |
|
lgtm. you could test the numbers in the error message but that seems overkill I guess. |
| feature_names = [] | ||
| for i in range(len(cats)): | ||
| names = [ | ||
| input_features[i] + '_' + six.text_type(t) for t in cats[i]] |
There was a problem hiding this comment.
Would it be appropriate to use __?
There was a problem hiding this comment.
I'm fine with using single underscore. Merge?
Andy asked a similar question in #10198 (comment), wondering if there could be a parameter to override the default separator. I think the best options are |
|
Fixed the merge conflict. |
Reference Issues/PRs
Fixes #10181
What does this implement/fix? Explain your changes.
Added function get_feature_names() to CategoricalEncoder class. This is in
data.pyundersklearn.preprocessing