-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Introduces set_output API for pandas output #23734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Introduces set_output API for pandas output #23734
Conversation
sklearn/utils/set_output.py
Outdated
| output.columns = columns | ||
| if index is not None: | ||
| output.index = index | ||
| return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if-block ever used in our current scikit-learn transformers? Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?
One possible use-case I would see would be to enforce that the output dataframe of a FunctionTransformer that naturally generates a dataframe has consistent column names with what is returned by its get_feature_names_out method. However FunctionTransformer does not inherit from the SetOutputMixin class at the moment so it does not apply.
But I am wondering, for such transformers that naturally outputs dataframe, should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out
I'll prefer get_feature_names_out to already be consistent when it gets to wrapping the container. In that case, we can have SetOutputMixin be a no-op if the return of transform is already a DataFrame.
Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?
This if-block is for compat, but it's also to make sense of the API. If the output is already a dataframe and columns is passed in, then I thought a reasonable thing to do is to set the columns as well. Given the point, above I am okay with doing a no-op if the container is already a DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to make it a noop when the original_data is a pandas dataframe. This way SetOutputMixin does not do anything if estimator.transform already returns a DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still a lingering issue with FunctionTransformer.
This PR has been updated so that TranformerMixin inherits the SetOutputMixin. This means FunctionTransformer will get the set_output API if feature_names_out is defined. To be concrete, here is the current behavior for FunctionTransformer:
-
If
feature_names_outis defined, thenset_outputis available. The output columns will always be consistent withget_feature_names_outno matter iffuncreturns a dataframe or ndarray. -
If
feature_names_outis not defined thenset_outputis not available.funcis allow to return a dataframe or a ndarray. Not havingset_outputis inconvenient, specifically whenfuncreturns a dataframe. For example, if theFunctionTransformerwithoutset_outputis in complex pipeline, thenpipeline.set_output(transform="pandas")would fail to configureFunctionTransformer.
A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas"). Technically, we can "learn" the feature names out by running func in fit, but that introduces more computation compared to main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas").
Sounds reasonable. We can also raise a TypeError at transform time if set_ouput(transform="pandas") was previously called and that func does not naturally return a pandas dataframe.
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the SetOutputMixin into the TransformerMixin so we don't have to add it everywhere manually and to keep the inheritance list more reasonable.
That should work in theory, though things that output sparse will error on transform if someone did set_output on them. It might be nicer to error earlier, but we can still do that after merging the initial thing.
|
Ok so the problem with inheriting from If we add Let the bikeshedding begin! |
I see two possible paths: Support it directly in scikit-learn or an API to configure "arrow".
If they inherit from
If it's a global
The default setting will keep the behavior on |
What if they do not inherit from |
|
@thomasjpfan The only open point for me is the name of the default option of Once that is settled, I will give my review approval. |
For now, I prefer not to require the API from third party estimators. Our meta-estimators such as Adopting the
|
Co-authored-by: Christian Lorentzen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit split regarding the choice between None, "default" (why not "native"). I find that None does not specify the output type but this is also the case for "default" or "native". It means that the output type should be provided somewhere else (I assume the documentation).
I am therefore fine with any option.
Regarding the other changes. LGTM.
|
Looking through the code, we can not use I think the best option is to find a string that best describes the behavior. I think @lorentzenchr WDYT? |
|
Yes |
|
I updated this PR to use |
lorentzenchr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@thomasjpfan A great thank you for all your (years long) efforts to make this happen!
This reverts commit 5313958.
|
I merge. In case that #78 concludes to still change the default value, we can do that in a new PR. |
|
Yaaaaay!! |
* Introduces set_output API for all transformers * TransformerMixin inherits from _SetOutputMixin * Adds tests * Adds whatsnew * Adds example on using set_output API * Adds developer docs for set_output
Reference Issues/PRs
Closes #23001
Implements SLEP018: scikit-learn/enhancement_proposals#68
What does this implement/fix? Explain your changes.
This PR introduces:
set_outputforPipelineand transformers inpreprocessing.set_output, which will be used for follow up PRs that addsset_outputto all other transformers.OutputTypeMixin, where most transformers only need to subclass it to get theset_outputAPI.set_config(output_transform="pandas")to set the output globally.