ENH Adds feature_names_out to preprocessing module by thomasjpfan · Pull Request #21079 · scikit-learn/scikit-learn

thomasjpfan · 2021-09-17T20:58:18Z

Reference Issues/PRs

Continues #18444

What does this implement/fix? Explain your changes.

This PR adds feature names out for the preprocessing module.

Any other comments?

Feels like Normalizer, OrdinalEncoder, and Binarizer could be in 1.0, but it's most likely too late now.

ogrisel

LGTM.

sklearn/preprocessing/tests/test_data.py

sklearn/utils/validation.py

…t-learn#21131)

…reprocessing

ogrisel

LGTM!

sklearn/utils/validation.py

sklearn/preprocessing/_data.py

sklearn/preprocessing/_encoders.py

…reprocessing

mayer79 · 2021-10-19T20:44:25Z

@thomasjpfan : Thanks for the work - this will be so useful in practice! Does this also cover the FunctionTransformer?

thomasjpfan · 2021-10-19T21:02:16Z

FunctionTransformer will need its own PR, because FunctionTransformer's API is a bit more flexible. For the normal case, FunctionTransformer(np.log), the features are one-to-one. On the other hand, the API can also be used to output multiple features for every input feature:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

def two_columns(X):
    return np.concatenate([X, 2*X], axis=1)

transformer = FunctionTransformer(two_columns)
X = np.array([[0, 1], [2, 3]])

transformer.transform(X)
# array([[0, 1, 0, 2],
#        [2, 3, 4, 6]])

I have some thoughts on what API to use for this and it will be in a follow up PR.

adrinjalali

Nit, otherwise LGTM

adrinjalali · 2021-10-20T15:04:05Z

sklearn/preprocessing/_data.py


        return K

+    def get_feature_names_out(self, input_features=None):


Isn't the contents of this method copy/pasted across for every transformer which uses the f"{class_name}{i}" for i in range(self.n_features_in_)] pattern? Shouldn't we move it to a _ClassNameFeatureNameMixin kinda thing?

Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out

I would do it now, and iteratively improve it, rather than a big PR at the end. But also check #21334, which goes to this direction.

Fine with me.

For _ClassNameFeatureNameMixin to work in general, it needs a way to get the "number of feature names out" form the actual class. KernelCenterer is a special case where it turns out that n_features_in_ == n_features_out_ and the names are prefixed.

In general, #21334 the feature_names_out_ are different than the feature names going in.

If we want to work toward a mixin, we can wait and work out a solution in #21334 and apply it here.

#21334 is now merged so this PR can be updated accordingly.

sklearn/preprocessing/tests/test_data.py

sklearn/preprocessing/tests/test_encoders.py

sklearn/utils/validation.py

glemaitre · 2021-10-20T15:38:21Z

sklearn/preprocessing/_data.py


        return K

+    def get_feature_names_out(self, input_features=None):


Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out

…reprocessing

ageron · 2021-10-25T02:02:15Z

FunctionTransformer will need its own PR, because FunctionTransformer's API is a bit more flexible.
[...]
I have some thoughts on what API to use for this and it will be in a follow up PR.

Hi @thomasjpfan , I'd love to know your plan for this. For example, suppose you have a DataFrame with features A, B, C, D, and you'd like to create a simple pipeline that runs a SimpleImputer on all columns and adds two new features equal to A/B and C/D. It doesn't sound too hard in principle, but I can't find a simple way to do it using pipelines and column transformers, and getting nice feature names out. Here's the least horrible I found:

def compute_ratio(X):
    X = getattr(X, "values", X)
    return X[:, [0]] / X[:, [1]]
    
def feature_ratio_transformer(ratio_name):
    return make_pipeline(SimpleImputer(),
                         FunctionTransformer(compute_ratio,
                                             feature_names_out=[ratio_name]))

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (feature_ratio_transformer("A/B ratio"), ["A", "B"]),
    (feature_ratio_transformer("C/D ratio"), ["C", "D"]),
)

output = preprocessing.fit_transform(pd.DataFrame({
    "A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9], "D": [10, 11, 12]}))

>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
       'passthrough__D', 'pipeline-1__A/B ratio', 'pipeline-2__C/D ratio'],
      dtype=object)

It feels simpler to just write a custom transformer that does everything, but that would go against the core principle of Scikit-Learn of keeping things composable.

Perhaps in this example it would be simpler if the ColumnTransformer let us specify the feature names out:

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (feature_ratio_transformer(), ["A", "B"], ["A/B ratio"]),
    (feature_ratio_transformer(), ["C", "D"], ["C/D ratio"]),
)

Wdyt?

glemaitre · 2021-10-25T08:34:40Z

Since FunctionTransformer is indeed different, we should probably amend (maybe first accept it :)) the SLEP007 to specify exactly what do we intend as an implementation for this case.

ageron · 2021-10-25T21:49:22Z

Adding a feature_names_out hyperparameter to FunctionTransformer makes the following pipeline possible. It doesn't look too bad:

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

def ratio_transformer(ratio_name):
    def compute_ratio(X):
        X = getattr(X, "values", X)
        return X[:, [0]] / X[:, [1]]

    return FunctionTransformer(compute_ratio,
                               feature_names_out=[ratio_name])

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (ratio_transformer("A/B ratio"), ["A", "B"]),
    (ratio_transformer("C/D ratio"), ["C", "D"]),
)

df = pd.DataFrame({"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9],
                   "D": [10, 11, 12]})

output = preprocessing.fit_transform(df)

>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
       'passthrough__D', 'functiontransformer-1__A/B ratio',
       'functiontransformer-2__C/D ratio'], dtype=object)

FYI, I currently use the following monkey-patching function to add the feature_names_out hyperparameter to the constructor of FunctionTransformer, and to use it in get_feature_names_out(). This function also patches SimpleImputer to add get_feature_names_out() and Pipeline.get_feature_names_out() to let the feature names propagate through the pipeline.

Details

def monkey_patch_get_signature_names_out():
    """Monkey patch some classes which did not handle get_feature_names_out()
       correctly in 1.0.0."""
    from inspect import Signature, signature, Parameter
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline, Pipeline
    from sklearn.preprocessing import FunctionTransformer, StandardScaler

    default_get_feature_names_out = StandardScaler.get_feature_names_out

    if not hasattr(SimpleImputer, "get_feature_names_out"):
      print("Monkey-patching SimpleImputer.get_feature_names_out()")
      SimpleImputer.get_feature_names_out = default_get_feature_names_out

    if not hasattr(FunctionTransformer, "get_feature_names_out"):
        print("Monkey-patching FunctionTransformer.get_feature_names_out()")
        orig_init = FunctionTransformer.__init__
        orig_sig = signature(orig_init)

        def __init__(*args, feature_names_out=None, **kwargs):
            orig_sig.bind(*args, **kwargs)
            orig_init(*args, **kwargs)
            args[0].feature_names_out = feature_names_out

        __init__.__signature__ = Signature(
            list(signature(orig_init).parameters.values()) + [
                Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])

        def get_feature_names_out(self, names=None):
            if self.feature_names_out is None:
                return default_get_feature_names_out(self, names)
            elif callable(self.feature_names_out):
                return self.feature_names_out(names)
            else:
                return self.feature_names_out

        FunctionTransformer.__init__ = __init__
        FunctionTransformer.get_feature_names_out = get_feature_names_out

    p = make_pipeline(SimpleImputer(), SimpleImputer())
    p.fit_transform(pd.DataFrame({"A": [1., 2.], "B": [3., 4.]}))
    if list(p.get_feature_names_out()) == ["x0", "x1"]:
        print("Monkey-patching Pipeline.get_feature_names_out()")
        def get_feature_names_out(self, names=None):
            names = default_get_feature_names_out(self, names)
            for transformer in self:
                names = transformer.get_feature_names_out(names)
            return names

        Pipeline.get_feature_names_out = get_feature_names_out

monkey_patch_get_signature_names_out()

…reprocessing

ogrisel · 2021-11-05T15:52:50Z

@ageron would you be interested in opening a PR for the case of FunctionTransformer?

ogrisel

Still +1 for merging this, with or without the integration of #21334 for KernelCenterer.

adrinjalali · 2021-11-05T17:24:36Z

I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase.

ageron · 2021-11-06T00:16:18Z

@ogrisel , sure I'll give it a shot.

ogrisel · 2021-12-06T11:12:17Z

I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase.

@adrinjalali I made the requested change in 2cd55e9.

…reprocessing

adrinjalali

LGTM, happy for this to be merged once conflict is resolved.

…reprocessing

…nto pr/21079

lesteve · 2022-02-07T11:09:42Z

Merging this one since CI is green and there were already two approvals

Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: 赵丰 (Zhao Feng) <[email protected]> Co-authored-by: Niket Jain <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

thomasjpfan added 3 commits September 17, 2021 16:50

ENH Adds feature_names_out to preprocessing module

6bd9df5

DOC Adds whats new

fcd7e70

DOC Adds whats new

88ea6ad

github-actions bot added module:preprocessing module:utils labels Sep 17, 2021

Merge branch 'main' into feature_names_out_preprocessing

e87920b

ogrisel approved these changes Sep 24, 2021

View reviewed changes

sklearn/preprocessing/tests/test_data.py Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

zhaofeng-shu33 and others added 5 commits September 24, 2021 09:33

DOC remove redundant code in GPR example (scikit-learn#21133)

a704c82

DOC Ensures that RandomizedSearchCV passes numpydoc validation (sciki…

7eba091

…t-learn#21131)

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

321e682

…reprocessing

CLN Address comments

19e4ff1

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

02193bc

…reprocessing

ogrisel approved these changes Oct 13, 2021

View reviewed changes

ogrisel added the Waiting for Reviewer label Oct 13, 2021

adrinjalali mentioned this pull request Oct 13, 2021

Implement get_feature_names_out for all estimators #21308

Closed

14 tasks

adrinjalali reviewed Oct 13, 2021

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Show resolved Hide resolved

lesteve mentioned this pull request Oct 14, 2021

ENH Add get_feature_names_out for random_projection module #21330

Merged

thomasjpfan added 3 commits October 14, 2021 20:26

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

bcadc08

…reprocessing

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

ed04543

…reprocessing

DOC Fix grammar

82d19c7

adrinjalali reviewed Oct 20, 2021

View reviewed changes

glemaitre reviewed Oct 20, 2021

View reviewed changes

thomasjpfan added 4 commits October 20, 2021 13:38

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

0e7c6db

…reprocessing

CLN Address comments

ff31d1b

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

48a8aac

…reprocessing

REV Revert unneeded change

01a32d3

thomasjpfan added 2 commits October 26, 2021 11:38

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

ef93b22

…reprocessing

CLN Merge conflict

61cde09

Merge branch 'main' into feature_names_out_preprocessing

a5c7ef4

ogrisel approved these changes Nov 5, 2021

View reviewed changes

Merge branch 'main' into feature_names_out_preprocessing

b9831d3

ageron mentioned this pull request Nov 6, 2021

ENH Add get_feature_names_out to FunctionTransformer #21569

Merged

ogrisel mentioned this pull request Nov 17, 2021

MAINT reorg/fix Kernel PCA entry in whats_new/v1.1.rst #21695

Merged

ogrisel added 2 commits December 2, 2021 22:48

Merge branch 'main' into feature_names_out_preprocessing

a3147d8

Make KernelCenterer inherit from _ClassNamePrefixFeaturesOutMixin

2cd55e9

thomasjpfan added 2 commits December 6, 2021 11:22

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

dbbe2d8

…reprocessing

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

3587c45

…reprocessing

adrinjalali approved these changes Jan 4, 2022

View reviewed changes

thomasjpfan and others added 4 commits January 4, 2022 14:32

Merge remote-tracking branch 'upstream/main' into feature_names_out_p…

35f32aa

…reprocessing

DOC Adjust merge

8778438

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

87bde75

…nto pr/21079

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

683f849

…nto pr/21079

lesteve merged commit d7feac0 into scikit-learn:main Feb 7, 2022


		return K

		def get_feature_names_out(self, input_features=None):

Uh oh!

Conversation

thomasjpfan commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayer79 commented Oct 19, 2021

Uh oh!

thomasjpfan commented Oct 19, 2021

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

adrinjalali Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 5, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

ageron commented Oct 25, 2021

Uh oh!

glemaitre commented Oct 25, 2021

Uh oh!

ageron commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 5, 2021

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Nov 5, 2021

Uh oh!

ageron commented Nov 6, 2021

Uh oh!

ogrisel commented Dec 6, 2021

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Feb 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

thomasjpfan commented Sep 17, 2021 •

edited

Loading

ageron commented Oct 25, 2021 •

edited

Loading

ogrisel left a comment •

edited

Loading