ENH Adds feature_names_out to preprocessing module#21079
ENH Adds feature_names_out to preprocessing module#21079lesteve merged 28 commits intoscikit-learn:mainfrom
Conversation
|
@thomasjpfan : Thanks for the work - this will be so useful in practice! Does this also cover the |
|
import numpy as np
from sklearn.preprocessing import FunctionTransformer
def two_columns(X):
return np.concatenate([X, 2*X], axis=1)
transformer = FunctionTransformer(two_columns)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
# array([[0, 1, 0, 2],
# [2, 3, 4, 6]])I have some thoughts on what API to use for this and it will be in a follow up PR. |
sklearn/preprocessing/_data.py
Outdated
|
|
||
| return K | ||
|
|
||
| def get_feature_names_out(self, input_features=None): |
There was a problem hiding this comment.
Isn't the contents of this method copy/pasted across for every transformer which uses the f"{class_name}{i}" for i in range(self.n_features_in_)] pattern? Shouldn't we move it to a _ClassNameFeatureNameMixin kinda thing?
There was a problem hiding this comment.
Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out
There was a problem hiding this comment.
I would do it now, and iteratively improve it, rather than a big PR at the end. But also check #21334, which goes to this direction.
There was a problem hiding this comment.
For _ClassNameFeatureNameMixin to work in general, it needs a way to get the "number of feature names out" form the actual class. KernelCenterer is a special case where it turns out that n_features_in_ == n_features_out_ and the names are prefixed.
In general, #21334 the feature_names_out_ are different than the feature names going in.
If we want to work toward a mixin, we can wait and work out a solution in #21334 and apply it here.
There was a problem hiding this comment.
#21334 is now merged so this PR can be updated accordingly.
sklearn/preprocessing/_data.py
Outdated
|
|
||
| return K | ||
|
|
||
| def get_feature_names_out(self, input_features=None): |
There was a problem hiding this comment.
Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out
Hi @thomasjpfan , I'd love to know your plan for this. For example, suppose you have a DataFrame with features A, B, C, D, and you'd like to create a simple pipeline that runs a def compute_ratio(X):
X = getattr(X, "values", X)
return X[:, [0]] / X[:, [1]]
def feature_ratio_transformer(ratio_name):
return make_pipeline(SimpleImputer(),
FunctionTransformer(compute_ratio,
feature_names_out=[ratio_name]))
preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(feature_ratio_transformer("A/B ratio"), ["A", "B"]),
(feature_ratio_transformer("C/D ratio"), ["C", "D"]),
)
output = preprocessing.fit_transform(pd.DataFrame({
"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9], "D": [10, 11, 12]}))>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
'passthrough__D', 'pipeline-1__A/B ratio', 'pipeline-2__C/D ratio'],
dtype=object)It feels simpler to just write a custom transformer that does everything, but that would go against the core principle of Scikit-Learn of keeping things composable. Perhaps in this example it would be simpler if the preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(feature_ratio_transformer(), ["A", "B"], ["A/B ratio"]),
(feature_ratio_transformer(), ["C", "D"], ["C/D ratio"]),
)Wdyt? |
|
Since |
|
Adding a import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
def ratio_transformer(ratio_name):
def compute_ratio(X):
X = getattr(X, "values", X)
return X[:, [0]] / X[:, [1]]
return FunctionTransformer(compute_ratio,
feature_names_out=[ratio_name])
preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(ratio_transformer("A/B ratio"), ["A", "B"]),
(ratio_transformer("C/D ratio"), ["C", "D"]),
)
df = pd.DataFrame({"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9],
"D": [10, 11, 12]})
output = preprocessing.fit_transform(df)>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
'passthrough__D', 'functiontransformer-1__A/B ratio',
'functiontransformer-2__C/D ratio'], dtype=object)FYI, I currently use the following monkey-patching function to add the Detailsdef monkey_patch_get_signature_names_out():
"""Monkey patch some classes which did not handle get_feature_names_out()
correctly in 1.0.0."""
from inspect import Signature, signature, Parameter
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
default_get_feature_names_out = StandardScaler.get_feature_names_out
if not hasattr(SimpleImputer, "get_feature_names_out"):
print("Monkey-patching SimpleImputer.get_feature_names_out()")
SimpleImputer.get_feature_names_out = default_get_feature_names_out
if not hasattr(FunctionTransformer, "get_feature_names_out"):
print("Monkey-patching FunctionTransformer.get_feature_names_out()")
orig_init = FunctionTransformer.__init__
orig_sig = signature(orig_init)
def __init__(*args, feature_names_out=None, **kwargs):
orig_sig.bind(*args, **kwargs)
orig_init(*args, **kwargs)
args[0].feature_names_out = feature_names_out
__init__.__signature__ = Signature(
list(signature(orig_init).parameters.values()) + [
Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])
def get_feature_names_out(self, names=None):
if self.feature_names_out is None:
return default_get_feature_names_out(self, names)
elif callable(self.feature_names_out):
return self.feature_names_out(names)
else:
return self.feature_names_out
FunctionTransformer.__init__ = __init__
FunctionTransformer.get_feature_names_out = get_feature_names_out
p = make_pipeline(SimpleImputer(), SimpleImputer())
p.fit_transform(pd.DataFrame({"A": [1., 2.], "B": [3., 4.]}))
if list(p.get_feature_names_out()) == ["x0", "x1"]:
print("Monkey-patching Pipeline.get_feature_names_out()")
def get_feature_names_out(self, names=None):
names = default_get_feature_names_out(self, names)
for transformer in self:
names = transformer.get_feature_names_out(names)
return names
Pipeline.get_feature_names_out = get_feature_names_out
monkey_patch_get_signature_names_out() |
|
@ageron would you be interested in opening a PR for the case of |
|
I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase. |
|
@ogrisel , sure I'll give it a shot. |
@adrinjalali I made the requested change in 2cd55e9. |
adrinjalali
left a comment
There was a problem hiding this comment.
LGTM, happy for this to be merged once conflict is resolved.
|
Merging this one since CI is green and there were already two approvals |
Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: 赵丰 (Zhao Feng) <[email protected]> Co-authored-by: Niket Jain <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
Reference Issues/PRs
Continues #18444
What does this implement/fix? Explain your changes.
This PR adds feature names out for the preprocessing module.
Any other comments?
Feels like
Normalizer,OrdinalEncoder, andBinarizercould be in 1.0, but it's most likely too late now.