ENH Introduces set_output API for pandas output #23734

thomasjpfan · 2022-06-23T00:23:14Z

Reference Issues/PRs

Closes #23001
Implements SLEP018: scikit-learn/enhancement_proposals#68

What does this implement/fix? Explain your changes.

This PR introduces:

set_output for Pipeline and transformers in preprocessing.
Common test to check the behavior of set_output, which will be used for follow up PRs that adds set_output to all other transformers.
OutputTypeMixin, where most transformers only need to subclass it to get the set_output API.
Global configuration option set_config(output_transform="pandas") to set the output globally.

sklearn/utils/set_output.py

ogrisel · 2022-06-27T17:00:15Z

sklearn/utils/set_output.py

+            output.columns = columns
+        if index is not None:
+            output.index = index
+        return output


Is this if-block ever used in our current scikit-learn transformers? Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?

One possible use-case I would see would be to enforce that the output dataframe of a FunctionTransformer that naturally generates a dataframe has consistent column names with what is returned by its get_feature_names_out method. However FunctionTransformer does not inherit from the SetOutputMixin class at the moment so it does not apply.

But I am wondering, for such transformers that naturally outputs dataframe, should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out?

should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out

I'll prefer get_feature_names_out to already be consistent when it gets to wrapping the container. In that case, we can have SetOutputMixin be a no-op if the return of transform is already a DataFrame.

Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?

This if-block is for compat, but it's also to make sense of the API. If the output is already a dataframe and columns is passed in, then I thought a reasonable thing to do is to set the columns as well. Given the point, above I am okay with doing a no-op if the container is already a DataFrame.

I updated the PR to make it a noop when the original_data is a pandas dataframe. This way SetOutputMixin does not do anything if estimator.transform already returns a DataFrame.

Is this resolved?

I think there is still a lingering issue with FunctionTransformer.

This PR has been updated so that TranformerMixin inherits the SetOutputMixin. This means FunctionTransformer will get the set_output API if feature_names_out is defined. To be concrete, here is the current behavior for FunctionTransformer:

If feature_names_out is defined, then set_output is available. The output columns will always be consistent with get_feature_names_out no matter if func returns a dataframe or ndarray.

If feature_names_out is not defined then set_output is not available. func is allow to return a dataframe or a ndarray. Not having set_output is inconvenient, specifically when func returns a dataframe. For example, if the FunctionTransformer without set_output is in complex pipeline, then pipeline.set_output(transform="pandas") would fail to configure FunctionTransformer.

A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas"). Technically, we can "learn" the feature names out by running func in fit, but that introduces more computation compared to main.

A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas").

Sounds reasonable. We can also raise a TypeError at transform time if set_ouput(transform="pandas") was previously called and that func does not naturally return a pandas dataframe.

sklearn/utils/set_output.py

…e_v3

amueller

I think we should move the SetOutputMixin into the TransformerMixin so we don't have to add it everywhere manually and to keep the inheritance list more reasonable.
That should work in theory, though things that output sparse will error on transform if someone did set_output on them. It might be nicer to error earlier, but we can still do that after merging the initial thing.

doc/whats_new/v1.2.rst

sklearn/_config.py

sklearn/preprocessing/_discretization.py

amueller · 2022-07-17T19:20:55Z

Ok so the problem with inheriting from TransformerMixin is that it's hard to overwrite the behavior done in the mixin.
With the implementation in this PR, transformers can decide not to inherit from SetOutputMixin and implement transform and fit_transform themselves to support set_output.

If we add SetOutputMixin to TransformerMixin everybody inheriting from TransformerMixin will automatically get SetOutputMixin but it's harder to customize.
There would be two ways to allow customization, either via making _wrap_method_output a method that someone could overwrite, or by adding a **kwargs to __init_subclass__ which would allow someone to disable the automatic wrapping behavior.

Let the bikeshedding begin!

…e_v3

doc/developers/develop.rst

thomasjpfan · 2022-10-07T22:03:41Z

How difficult/easy would it be, to extend the set_output options to, e.g., "arrow".

I see two possible paths: Support it directly in scikit-learn or an API to configure "arrow".

Add another function _wrap_in_arrow_container and dispatch to it in _wrap_data_with_container when set_output(transform="arrow").
Allow set_output(transform=callable), where a users defines a callable that constructs an arrow container. This will likely have the same signature of _wrap_in_pandas_container: (data, columns, index).

What happens with 3rd party transformers in a pipeline if global set_config(transform_output="pandas") has been set?

If they inherit from TransformerMixin, then they will auto opt-into set_output and transform will output DataFrames. They can opt-out by setting auto_wrap_output_keys=None.

If set_config(transform_output="pandas") has been set, and a model is saved via pickle and later loaded in a different python process, will it ouput pandas?

If it's a global set_config, then the global option needs to be set again to output pandas. If it's the local option, i.e. transform.set_output, then transform outputs pandas.

In the default setting, what happens with other Array API compliant data containers than numpy, e.g. cupy, see #22352?

The default setting will keep the behavior on main. The transformer decides what it wants to output, ndarray, cupy, dataframes, etc.

lorentzenchr · 2022-10-09T14:30:24Z

What happens with 3rd party transformers in a pipeline if global set_config(transform_output="pandas") has been set?

If they inherit from TransformerMixin, then they will auto opt-into set_output and transform will output DataFrames. They can opt-out by setting auto_wrap_output_keys=None.

What if they do not inherit from TransformerMixin? Or stated otherwise: Are we clear enough on the expected API for a 3rd party transformer (that - for some reasons - does not want to depend on scikit-learn)?

lorentzenchr · 2022-10-09T14:33:47Z

@thomasjpfan The only open point for me is the name of the default option of transform_output, see #23734 (comment). Your suggestion to use None instead of "default" is good for me. As this would be a deviation from the SLEP, other opinions are highly appreciated.

Once that is settled, I will give my review approval.

thomasjpfan · 2022-10-09T19:01:06Z

What if they do not inherit from TransformerMixin? Or stated otherwise: Are we clear enough on the expected API for a 3rd party transformer (that - for some reasons - does not want to depend on scikit-learn)?

For now, I prefer not to require the API from third party estimators. Our meta-estimators such as ColumnTransformer will complain when transformers set_output. As for the global option, I do not think we can require third parties to respect it.

Adopting the set_output API fully, requires at least a soft dependency on scikit-learn and pandas:

scikit-learn: To follow scikit-learn's global config, one needs to get the global config from scikit-learn.
pandas: Outputting DataFrames requires it.

Co-authored-by: Christian Lorentzen <[email protected]>

glemaitre

I am a bit split regarding the choice between None, "default" (why not "native"). I find that None does not specify the output type but this is also the case for "default" or "native". It means that the output type should be provided somewhere else (I assume the documentation).

I am therefore fine with any option.

Regarding the other changes. LGTM.

…e_v3

thomasjpfan · 2022-10-11T21:46:52Z

Looking through the code, we can not use None instead of "default". Currently, the signature is: set_output(self, transform=None), where None is a sentinel to mean do "not configure transform". This is required if we have set_output(predict="pandas"), which configures the container for predict but leaves transform alone. Secondly, est.set_output() with no input configures nothing. I can use another sentinel object, but that complicates the set_output API.

I think the best option is to find a string that best describes the behavior. I think "native" is okay. Another option is "any"

@lorentzenchr WDYT?

lorentzenchr · 2022-10-11T21:49:06Z

Yes "native" is fine, even good! I'm happy with any other default value than "default", which I would consider bad design because it does not tell anything about the actual behaviour.

thomasjpfan · 2022-10-11T22:33:02Z

I updated this PR to use "native". I also opened a PR to update SLEP: scikit-learn/enhancement_proposals#78

lorentzenchr

LGTM
@thomasjpfan A great thank you for all your (years long) efforts to make this happen!

This reverts commit 5313958.

lorentzenchr · 2022-10-12T22:11:50Z

I merge. In case that #78 concludes to still change the default value, we can do that in a new PR.

amueller · 2022-10-14T17:29:33Z

Yaaaaay!!

… SLEP018 and scikit-learn#23734

* Introduces set_output API for all transformers * TransformerMixin inherits from _SetOutputMixin * Adds tests * Adds whatsnew * Adds example on using set_output API * Adds developer docs for set_output

thomasjpfan added 7 commits June 22, 2022 14:52

ENH Introduces set_output API

e1ea0a9

CLN Reduces API surface

07078a1

ENH Expand test for failing case

1faf347

CLN Rename mixin

9f9680a

ENH Add full support for get_output in preprocessing

a6a4b59

DOC Adds comment in clone

4ae72c5

DOC Adds whats new number

beca084

github-actions bot added module:pipeline module:preprocessing module:utils labels Jun 23, 2022

thomasjpfan added 2 commits June 22, 2022 20:27

CLN Less diff

021d36c

ENH Use keyword only arguments for public API

de0db34

ogrisel reviewed Jun 28, 2022

View reviewed changes

thomasjpfan added 10 commits July 4, 2022 19:02

CLN Address comments

63c4204

Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…

ee4cdff

…e_v3

FIX Fixes typo

9d318b1

CLN Use dictionary instead

609f4f0

CLN Better error message

64c761a

TST Adds more code coverage

471e2d5

CLN Simplifies implementation

63c2011

CLN Remove unneeded parameter

89a854e

CLN Simplify validation

20fed9e

Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…

91e2448

…e_v3

amueller reviewed Jul 17, 2022

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

sklearn/_config.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

thomasjpfan added 3 commits July 17, 2022 10:55

CLN Rename output_transform transform_output

d63f059

CLN Fix name

32d9252

DOC Update whats new

126a9aa

thomasjpfan mentioned this pull request Jul 17, 2022

VOTE SLEP018 - Pandas Output for Transformers scikit-learn/enhancement_proposals#72

Merged

Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…

0d02e50

…e_v3

thomasjpfan added 2 commits October 7, 2022 17:16

STY Slight formatting

01771ae

TST Adds testing for fit_transform and fit.transform

1421e8a

lorentzenchr reviewed Oct 7, 2022

View reviewed changes

doc/developers/develop.rst Outdated Show resolved Hide resolved

Update doc/developers/develop.rst

24c3fc1

Co-authored-by: Christian Lorentzen <[email protected]>

glemaitre approved these changes Oct 10, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…

b87ad84

…e_v3

ENH Rename 'default' to 'native'

5313958

thomasjpfan mentioned this pull request Oct 11, 2022

DOC Rename 'default' to 'native' for set_output SLEP scikit-learn/enhancement_proposals#78

Closed

lorentzenchr approved these changes Oct 12, 2022

View reviewed changes

Revert "ENH Rename 'default' to 'native'"

48add35

This reverts commit 5313958.

lorentzenchr merged commit 2a6703d into scikit-learn:main Oct 12, 2022

lorentzenchr mentioned this pull request Oct 12, 2022

Pandas in, Pandas out? #5523

Closed

This was referenced Oct 14, 2022

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

Closed

ENH Makes OneToOneFeatureMixin and ClassNamePrefixFeaturesOutMixin public #24688

Merged

avm19 pushed a commit to avm19/scikit-learn that referenced this pull request Oct 18, 2022

Use set_config(transform_output=pandas) and string feature names. See…

33622ae

… SLEP018 and scikit-learn#23734

thomasjpfan mentioned this pull request Oct 18, 2022

ENH Improve set_output compatibility in ColumnTransformer #24699

Merged

lorentzenchr mentioned this pull request Nov 1, 2022

feature names - NamedArray #14315

Closed

ravwojdyla mentioned this pull request Jan 4, 2023

transform_output set in config_context not preserved in the Transformer object? #25287

Closed

popescu-v mentioned this pull request Jul 4, 2023

Implement set_output API in sklearn estimators KhiopsML/khiops-python#43

Open

SuccessMoses mentioned this pull request Nov 29, 2024

FEAT add inverse_transform parameter to _SetOutputMixin.set_output #30376

Open

Uh oh!

ENH Introduces set_output API for pandas output #23734

ENH Introduces set_output API for pandas output #23734

Uh oh!

Conversation

thomasjpfan commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

Uh oh!

ogrisel Jun 27, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 23, 2022

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amueller commented Jul 17, 2022 • edited by thomasjpfan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Oct 7, 2022

Uh oh!

lorentzenchr commented Oct 9, 2022

Uh oh!

lorentzenchr commented Oct 9, 2022

Uh oh!

thomasjpfan commented Oct 9, 2022

Uh oh!

glemaitre left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Oct 11, 2022

Uh oh!

thomasjpfan commented Oct 11, 2022

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Oct 12, 2022

Uh oh!

amueller commented Oct 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thomasjpfan commented Jun 23, 2022 •

edited

Loading

ogrisel Oct 3, 2022 •

edited

Loading

amueller commented Jul 17, 2022 •

edited by thomasjpfan

Loading

glemaitre left a comment •

edited

Loading

thomasjpfan commented Oct 11, 2022 •

edited

Loading