[MRG+2] add get_feature_names to PolynomialFeatures#6372
[MRG+2] add get_feature_names to PolynomialFeatures#6372jakevdp merged 4 commits intoscikit-learn:masterfrom
Conversation
|
ping @jakevdp ;) |
48eb9e7 to
30e591e
Compare
| return np.vstack(np.bincount(c, minlength=self.n_input_features_) | ||
| for c in combinations) | ||
|
|
||
| def get_feature_names(self, input_features=None): |
There was a problem hiding this comment.
I've been suggesting for a while that this is an appropriate way to deal with feature names in extractor/transformer pipelines. I'm happy to see this.
There was a problem hiding this comment.
I've been wanting to do it but didn't have time. I'm just writing the book chapter about preprocessing and I'm embarrassed not having that feature ;)
There was a problem hiding this comment.
Hey, who knew writing books would be good for something?
|
LGTM. |
|
I qualify that LGTM. I think we need to decide whether A further small issue: should these be |
sklearn/preprocessing/data.py
Outdated
| ---------- | ||
| input_features : list of string, length n_features, optional | ||
| String names for input features if available. By default, | ||
| "x0", "x1", ... "x_n_features" is used. |
There was a problem hiding this comment.
Should have underscores (x_0, x_1) to match code below.
There was a problem hiding this comment.
Maybe it would be cleaner without underscores? I find things like x_1^2 x_2^1 harder to visually parse than x1^2 x2^1
There was a problem hiding this comment.
yeah... well I was thinking about latex. But maybe that's silly.
30e591e to
ec4d92d
Compare
|
I changed the naming scheme to |
ec4d92d to
bff115c
Compare
|
added a unicode test. Is that what you had in mind @jnothman ? |
| # test some unicode | ||
| poly = PolynomialFeatures(degree=1, include_bias=True).fit(X) | ||
| feature_names = poly.get_feature_names([u"\u0001F40D", u"\u262E", u"\u05D0"]) | ||
| assert_array_equal(["1", u"\u0001F40D^1", u"\u262E^1", u"\u05D0^1"], |
There was a problem hiding this comment.
It's a bit awkward that 1 is not unicode. Can we find a simple convention for such things?
That consistency (I assume you mean with visualisation) might be worthwhile, though I know that the use of the first subscript (as opposed to |
|
hm, I can't reproduce the error... it seems to be in |
|
|
bff115c to
ddc1740
Compare
|
should be fixed. I have no opinion re |
|
I would prefer |
9a11f46 to
897f8b6
Compare
|
+1 for merge |
|
+1 |
|
|
||
| Returns | ||
| ------- | ||
| output_feature_names : list of string, length n_output_features |
There was a problem hiding this comment.
Just a nitpick here:
Maybe there should be a "." in the end of this line?
There was a problem hiding this comment.
Oh I see.
So this line:
https://github.com/amueller/scikit-learn/blob/poly_feature_names/sklearn/preprocessing/data.py#L1586
ends with a "." because it's a new sentence used to describe output?
There was a problem hiding this comment.
Yes, it is description, not type.
|
should be good now. |
|
Once tests pass I'd say we can merge |
|
Thanks @amueller! |
[MRG+2] add get_feature_names to PolynomialFeatures
Fixes #6185, replaces #6216