ENH Add get_feature_names_out to FunctionTransformer by ageron · Pull Request #21569 · scikit-learn/scikit-learn

ageron · 2021-11-06T08:15:04Z

Reference Issues/PRs

Follow-up on #18444.
Part of #21308.
This new feature was discussed in #21079.

What does this implement/fix? Explain your changes.

Adds the get_feature_names_out method and a new parameter feature_names_out to preprocessing.FunctionTransformer. By default, get_feature_names_out returns the input feature names, but you can set feature_names_out to return a different list, which is especially useful when the number of output features differs from the number of input features.

For example, here's a FunctionTransformer that outputs a single feature, equal to the input's mean along axis=1:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

mean_transformer = FunctionTransformer(
    func=lambda X: X.mean(axis=1, keepdims=True),
    feature_names_out=["mean"]
)

X_trans = mean_transformer.fit_transform(np.random.rand(10,2))
print(mean_transformer.get_feature_names_out())  # prints ['mean']

The feature_names_out parameter may also be a callable. This is useful if the output feature names depend on the input feature names, and/or if they depend on parameters like kw_args. Here's an example that uses both. It's a transformer that appends n random features to existing features:

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

def add_n_random_features(X, n):
    return np.concatenate([X, np.random.rand(len(X), n)], axis=1)

def feature_names_out(transformer, input_features):
    n = transformer.kw_args["n"]
    return list(input_features) + [f"rnd{i}" for i in range(n)]

transformer = FunctionTransformer(
    func=add_n_random_features,
    feature_names_out=feature_names_out,
    kw_args=dict(n=3),
    validate=True,  # IMPORTANT (see discussion below)
)

df = pd.DataFrame({"a": np.random.rand(100), "b": np.random.rand(100)})
X_trans = transformer.fit_transform(df)
print(transformer.get_feature_names_out())  # prints ['a' 'b' 'rnd0' 'rnd1' 'rnd2']

Any other comments?

I have some concerns regarding the fact that validate is False by default, which means that n_features_in_ and feature_names_in_ are not set automatically. So if you create a FunctionTransformer with the default validate=False and feature_names_out=None, then when you call get_feature_names_out without any argument, it will raise an exception (unless transform was called before and func set n_feature_in_ or feature_names_in_). I tried to make this clear in the error message, but I'm worried that this will confuse users. Wdyt?

And if validate=False and you set feature_names_out to a callable, and call get_feature_names_out with no arguments, then the callable will get input_features=None as input (unless transform was called before and func set n_features_in_ or feature_names_in_). Users may be surprised by this. Should we output a warning in this case? Wdyt?

Moreover, as shown in the second code example above, the output feature names may depend on kw_args, so if feature_names_out is a callable, get_feature_names_out passes self to it, plus the input_features. I considered checking feature_names_out.__code__.co_varnames to decide whether to pass no arguments, or just the input_features, or the input_features and self. But __code__ is not used anywhere in the code base, and inspect is not used much, so I'm not sure whether such introspection would be frowned upon? I decided that it was simple enough to require users to always have two arguments: the transformer itself, and the input_features. Wdyt?

Lastly, when users want to create a FunctionTransformer that outputs a single feature, I expect that many will be tempted to set feature_names_out to a string instead of a list. To keep things consistent, I decided to raise an exception in this case, and have a clear error message to tell them to use ["foo"] instead. Wdyt?

thomasjpfan

For validate=False and the feature_names_out parameter is set, I am propose we set feature_names_in_ and n_features_in_, but not validate it during fit or transform.

As for the API, I am thinking of restricting feature_names_out two options at first:

None: No feature names out
callable: User provide function to compute feature names out

Two more options for follow up PRs:

'one-to-one': Feature names out == feature names in
array-like of strings: I am currently unsure about the use case for this option that the callable can not resolve. But we can discuss in a follow up.

ageron · 2021-11-06T20:48:45Z

Thanks @thomasjpfan . I'll remove the option to set feature_names_out to an array-like of strings in this PR (my goal was to have a simple API for what I expect to be a common use case, but it's not much harder to write a callable, and the API may be simpler with just one option). I 'll also replace the default option (currently None) with "one-to-one": I like this idea, it's simple enough to change, and it's more explicit.
Sounds good?

thomasjpfan · 2021-11-06T20:56:12Z

I 'll also replace the default option (currently None) with "one-to-one": I like this idea, it's simple enough to change, and it's more explicit.

I think the default still needs to be None. I do not think the function transformer can assume 'one-to-one' in the general case.

Let's add 'one-to-one' in a future PR and get the callable in first with this PR. The smaller the PR the easier it is to review and get merged.

…ake default 'one-to-one'

ageron · 2021-11-06T21:35:46Z

I just read your message, I had already updated the PR to remove the option to pass an array-like of strings, and I set the default to 'one-to-one'.
Isn't the most common use case for FunctionTransformer something like FunctionTransformer(func=np.log)? If so, then one-to-one is a reasonable default, isn't it? I can revert to None, but what should get_feature_names_out do in this case? Raise an exception saying "you must set 'feature_names_out'"?

thomasjpfan · 2021-11-06T22:06:30Z

Isn't the most common use case for FunctionTransformer something like FunctionTransformer(func=np.log)? If so, then one-to-one is a reasonable default, isn't it?

It is, but I do not think we can assume it. If a user pass a function that creates a column then 'one-to-one' would be wrong. I think ti is better to have the user explicit state it.

I can revert to None, but what should get_feature_names_out do in this case? Raise an exception saying "you must set 'feature_names_out'"?

We can use avaliable_if to conditionally expose get_feature_names_out depending on if feature_names_out is set:

scikit-learn/sklearn/utils/metaestimators.py

Line 140 in 48e83df

def available_if(check):

ageron · 2021-11-06T22:21:04Z

Thanks @thomasjpfan . I updated the PR to make None the default. Right now get_feature_names_out raises a ValueError if feature_names_out is None, but I'll use available_if instead, thanks for the tip.

…feature_names_out=None

ageron · 2021-11-07T04:37:40Z

I ran black, and flake8, make test-coverage, etc., but they didn't catch the issues with the numpydoc (a newline was missing) or with v1.1.rst (someone else had forgotten a `). I looked in the Contributing doc, but I can't find instructions to catch these errors before I push the code to github. Did I miss something?

ageron · 2021-11-14T21:08:29Z

Hi @thomasjpfan, is there anything else you need me to do for this PR?

ogrisel · 2021-11-17T17:19:19Z

I ran black, and flake8, make test-coverage, etc., but they didn't catch the issues with the numpydoc (a newline was missing) or with v1.1.rst (someone else had forgotten a `). I looked in the Contributing doc, but I can't find instructions to catch these errors before I push the code to github. Did I miss something?

For some reason the numpydoc validation was done externally and not part as the main test suite. I am not sure why we do that. We should probably run those checks as part of the main test suite to avoid the confusion.

ogrisel

LGTM. I think the PR in its current state should cover most useful cases. I did not see any particular defect. Just a small improvement suggestion for one of the exception messages below:

sklearn/preprocessing/_function_transformer.py

…geron/scikit-learn into function_transformer_feature_names_out

ageron · 2021-11-17T20:44:11Z

Thanks for reviewing, Olivier. I just made the change you suggested.

ogrisel · 2021-11-23T13:20:12Z

test_calibration_display_default_labels[None-_line1] that is broken on the CI should have been fixed in the main branch.

ageron · 2021-11-24T05:00:08Z

test_calibration_display_default_labels[None-_line1] that is broken on the CI should have been fixed in the main branch.

In such cases, should I pull and merge main into the PR branch?

ogrisel · 2021-11-25T15:47:08Z

In such cases, should I pull and merge main into the PR branch?

That would not hurt, and if the PR is "CI green ticked", it might get a better chance to attract reviewers' attention :)

ageron · 2021-11-25T23:45:12Z

Thanks @ogrisel , I merged main, now there's a beautiful green tick. 😊

thomasjpfan

Thanks for the update @ageron !

sklearn/preprocessing/_function_transformer.py

sklearn/preprocessing/tests/test_function_transformer.py

sklearn/preprocessing/_function_transformer.py

sklearn/preprocessing/tests/test_function_transformer.py

thomasjpfan

LGTM

ageron · 2021-11-30T19:12:16Z

Thanks for the review. 👍

nilslacroix

I copied your function in my scikit enviroment and tried to use it in my enviroment. However I still get the error as below, where preprocessor is my columntransformer and I try the following code:
preprocessor.get_feature_names_out()

Transformer argument looks like this:

('log', FunctionTransformer(np.log1p, validate=True), log_features)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 3>()
      1 xt = preprocessor.transform(X_test)
      2 #mapie.single_estimator_[1].estimator
----> 3 preprocessor.get_feature_names_out()

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:481, in ColumnTransformer.get_feature_names_out(self, input_features)
    479 transformer_with_feature_names_out = []
    480 for name, trans, column, _ in self._iter(fitted=True):
--> 481     feature_names_out = self._get_feature_name_out_for_transformer(
    482         name, trans, column, input_features
    483     )
    484     if feature_names_out is None:
    485         continue

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:446, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
    444 # An actual transformer
    445 if not hasattr(trans, "get_feature_names_out"):
--> 446     raise AttributeError(
    447         f"Transformer {name} (type {type(trans).__name__}) does "
    448         "not provide get_feature_names_out."
    449     )
    450 if isinstance(column, Iterable) and not all(
    451     isinstance(col, str) for col in column
    452 ):
    453     column = _safe_indexing(feature_names_in, column)

AttributeError: Transformer log (type FunctionTransformer) does not provide get_feature_names_out.

thomasjpfan · 2022-04-27T13:44:28Z

This feature is not released yet and will be released in v1.1. If you want to try out the feature now, you can install the nightly build:

pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn

nilslacroix · 2022-04-28T17:10:04Z

Are you sure this working as intended? I just installed Nightly and I still get exactly this error. The code is in my enviroment, at least function_transformer_.py has this method implemented.

preprocessor = ColumnTransformer(
    transformers=
    [   #('enc_obj', onehot, ["PropertyType"]),    #noworks
        ('pow', PowerTransformer(method="yeo-johnson"), pow_features),   #works
        ('log', FunctionTransformer(np.log1p, validate=True), log_features), #noworks
        ('enc_plz', BinaryEncoder(verbose=1), ["Postcode"]), #noworks
        ('enc_obj', OneHotEncoder(), ["PropertyType"]),    #noworks
        ("target", LeaveOneOutEncoder(), ["MeanPricePostcode"]),      #works
        ('features', "passthrough", unskewed_features)     #works
        
    ],
        n_jobs=-2)

pipeline2 = Pipeline(steps=[("preprocessor", preprocessor),
                           ("scaler", scaler),
                           ('clf', LinearSVR()),
                          ])        
pipeline2.fit(X_train, y_train)
pipeline2[:-1].get_feature_names_out()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [18], in <cell line: 8>()
      5 pipeline2.fit(X_train, y_train)
      6 #explainer = RegressionExplainer(pipeline2, X_test, y_test, shap="guess", n_jobs = multiprocessing.cpu_count()-1)
      7 #db1 = ExplainerDashboard(explainer)
----> 8 pipeline2[:-1].get_feature_names_out()

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\pipeline.py:733, in Pipeline.get_feature_names_out(self, input_features)
    727     if not hasattr(transform, "get_feature_names_out"):
    728         raise AttributeError(
    729             "Estimator {} does not provide get_feature_names_out. "
    730             "Did you mean to call pipeline[:-1].get_feature_names_out"
    731             "()?".format(name)
    732         )
--> 733     feature_names_out = transform.get_feature_names_out(feature_names_out)
    734 return feature_names_out

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:479, in ColumnTransformer.get_feature_names_out(self, input_features)
    477 transformer_with_feature_names_out = []
    478 for name, trans, column, _ in self._iter(fitted=True):
--> 479     feature_names_out = self._get_feature_name_out_for_transformer(
    480         name, trans, column, input_features
    481     )
    482     if feature_names_out is None:
    483         continue

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:447, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
    445 # An actual transformer
    446 if not hasattr(trans, "get_feature_names_out"):
--> 447     raise AttributeError(
    448         f"Transformer {name} (type {type(trans).__name__}) does "
    449         "not provide get_feature_names_out."
    450     )
    451 return trans.get_feature_names_out(names)

AttributeError: Transformer log (type FunctionTransformer) does not provide get_feature_names_out.

thomasjpfan · 2022-04-28T17:20:21Z

FunctionTransformer requires setting feature_names_out parameter to work. For simple one-to-one transformations, you can set feature_names_out to "one-to-one":

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

mean_transformer = FunctionTransformer(
    func=np.log1p,
    feature_names_out="one-to-one",
    validate=True
)

X = pd.DataFrame({"my_feature": [1, 2, 3]})

X_trans = mean_transformer.fit_transform(X)
print(mean_transformer.get_feature_names_out())
# ['my_feature']

nilslacroix · 2022-04-28T17:27:04Z

Thank you Thomas ... sorry for asking all these question that might be totally obvious :(

Add get_feature_names_out to FunctionTransformer

61b4ba5

github-actions bot added the module:preprocessing label Nov 6, 2021

ageron added 4 commits November 6, 2021 21:17

Add PR number

faddcf5

Remove unused variables

c482769

Add missing ` (unrelated to this PR)

012db73

Remove unreachable elif block

2f0aac0

thomasjpfan reviewed Nov 6, 2021

View reviewed changes

Remove option to set feature_names_out to an array-like of str, and m…

f91539a

…ake default 'one-to-one'

Default feature_names_out to None instead of 'one-to-one'

d3882f2

ageron added 3 commits November 7, 2021 11:31

Use available_if to ensure get_feature_names_out is not defined when …

6e7f13b

…feature_names_out=None

Add missing new line

f078eca

Add missing newline in method doc

2d3fae7

Merge branch 'main' into function_transformer_feature_names_out

fd7689b

ogrisel self-requested a review November 17, 2021 13:31

ogrisel mentioned this pull request Nov 17, 2021

Implement get_feature_names_out for all estimators #21308

Closed

14 tasks

ogrisel approved these changes Nov 17, 2021

View reviewed changes

sklearn/preprocessing/_function_transformer.py Outdated Show resolved Hide resolved

ageron added 2 commits November 18, 2021 09:41

Show value when feature_names_out is invalid

8726ab7

Merge branch 'function_transformer_feature_names_out' of github.com:a…

4fd1e4a

…geron/scikit-learn into function_transformer_feature_names_out

ogrisel added the Waiting for Reviewer label Nov 23, 2021

Merge branch 'main' into function_transformer_feature_names_out

e6e842a

thomasjpfan reviewed Nov 28, 2021

View reviewed changes

ageron added 3 commits November 28, 2021 14:46

Replace lambda transformer:... with lambda self:...

8cda16d

Convert names out to ndarray with dtype=object

0303c72

Check names out are ndarray with dtype=object

d8671e8

thomasjpfan changed the title ~~Add get_feature_names_out to FunctionTransformer~~ ENH Add get_feature_names_out to FunctionTransformer Nov 30, 2021

thomasjpfan approved these changes Nov 30, 2021

View reviewed changes

thomasjpfan merged commit d23d589 into scikit-learn:main Nov 30, 2021

thomasjpfan mentioned this pull request Dec 13, 2021

Add DateTimeTransformer? #21968

Open

nilslacroix reviewed Apr 27, 2022

View reviewed changes

ageron deleted the function_transformer_feature_names_out branch May 2, 2022 22:59

Uh oh!

Conversation

ageron commented Nov 6, 2021 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

ageron commented Nov 6, 2021

Uh oh!

thomasjpfan commented Nov 6, 2021

Uh oh!

ageron commented Nov 6, 2021

Uh oh!

thomasjpfan commented Nov 6, 2021

Uh oh!

ageron commented Nov 6, 2021

Uh oh!

ageron commented Nov 7, 2021

Uh oh!

ageron commented Nov 14, 2021

Uh oh!

ogrisel commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ageron commented Nov 17, 2021

Uh oh!

ogrisel commented Nov 23, 2021

Uh oh!

ageron commented Nov 24, 2021

Uh oh!

ogrisel commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ageron commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

ageron commented Nov 30, 2021

Uh oh!

nilslacroix left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Apr 27, 2022

Uh oh!

nilslacroix commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Apr 28, 2022

Uh oh!

nilslacroix commented Apr 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ageron commented Nov 6, 2021 •

edited by ogrisel

Loading

ogrisel commented Nov 17, 2021 •

edited

Loading

ogrisel commented Nov 25, 2021 •

edited

Loading

ageron commented Nov 25, 2021 •

edited

Loading

nilslacroix left a comment •

edited

Loading

nilslacroix commented Apr 28, 2022 •

edited

Loading