ENH Add get_feature_names_out for RandomTreesEmbedding module#21762
ENH Add get_feature_names_out for RandomTreesEmbedding module#21762jeremiedbb merged 23 commits intoscikit-learn:mainfrom
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @MaxwellLZH ! I recommend breaking this PR into 3 PRs:
- Keep this one for
RandomTreesEmbedding. - Another for
Voting* - Another for
Stacking*(Do not start this one untilVoting*is complete since they are related and will have the same discussions)
This is because I think there is an argument for generating more descriptive names for each of the cases above.
| assert_array_equal( | ||
| [f"randomtreesembedding{i}" for i in range(hasher._n_features_out)], names | ||
| ) |
There was a problem hiding this comment.
I think it is better to explicitly test the public API. We can transform and get the number of features out:
n_features_out = hasher.transform(X).shape[1]
assert_array_equal(
[f"randomtreesembedding{i}" for i in range(n_features_out)], names
)There is an argument for using something like randomtreesembedding_3_10, where 3 is represents the tree that used to generate the leaf, and 10 is the leaf index.
There was a problem hiding this comment.
Shall I leave the naming as it is for now? if we decided to go for randomtreesembedding_{i}_{j} then we can change the test cases accordingly later?
There was a problem hiding this comment.
We need to make a decision in this PR. I like using randomtreesembedding_{i}_{j}, can we update this PR to use this formatting and see what other reviewers think?
This means a custom get_feature_names_out for tree embedding.
There was a problem hiding this comment.
I've added a custom get_feature_names_out for tree embedding, where i is tree index starting from 1 and j is leaf index as suggested.
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
thomasjpfan
left a comment
There was a problem hiding this comment.
Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Enhacement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
thomasjpfan
left a comment
There was a problem hiding this comment.
Thanks for the update! I am happy with the naming for the features here. Let's see what other reviewers think.
ogrisel
left a comment
There was a problem hiding this comment.
LGTM once the suggestions below as taken into account. I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.
Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices? I am okay with that option as well. |
The current implementation is more informative but potentially a bit confusing. Using contiguous, leaf-only indexing would make the feature names less dependent on the internal tree data-structure but in a way this internal detail is already part of the public API because those are the indices returned by the So +0 for keeping the current indexing / naming scheme. |
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
I don't have a strong preference. I'm fine with reflecting the tree structure with the additional comments from Olivier. It might make it easier to debug if needed as well. |
…-learn#21762) Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
Reference Issues/PRs
Part of issue #21308 . This is the same PR as #21459 (I got into some git issue with the last PR )
What does this implement/fix? Explain your changes.
Implementing
get_feature_names_outforRandomForestEmbedding,VotingClassifier,VotingRegressor,StackingClassifierandStackingRegressor, with corresponding test cases.