-
-
Notifications
You must be signed in to change notification settings - Fork 33
VOTE: SLEP006 - Metadata Routing #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think @glemaitre , @thomasjpfan , and I are a +1 on this, so it'd be nice to hear from others. |
|
Here's a list of steps I think we need to take for this work to be merged with
Our plan is to hopefully have this feature in 1.1, which we should be releasing in late April/early May. Therefore it'd be nice to have the vote and the review of the first PR done ASAP. |
|
+1 I don't see a better way. A great thanks to @adrinjalali for all the work! I'm excited. Some notes:
|
|
@lorentzenchr passing column names is usually a constructor parameter, but one could pass that as metadata. But for your suggestion to work, we need |
|
My only remark at this point is that it should be explicit in the SLEP that my 2c |
Could we handle this pragmatically, i.e. vote in favor of this PR and then have a different discussion/vote on a name change (props or metadata)? |
|
That parameter is called |
agramfort
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy now :)
|
Through extensive discussions, we've previously resolved to use |
|
Is there a good way to comment on specific parts of the SLEP? Or is it too late for that now anyway? This is all about row-wise (per-sample) meta-data, not arbitrary meta-data, right? Or are we talking about arbitrary meta-data? |
This seems to me like it should be explained a bit more in the slep? |
|
Answers to @amueller 's questions:
You could leave comments the way you did here.
No. That would be very backward incompatible. We accept arbitrary metadata in
We could add an example. But I'm not sure if this is an important part of the SLEP?
We do say that the only thing requested by default is
Since we want the code to be explicit, and have no silent logic change when models start supporting metadata, nothing (except
Yes. |
Ok thanks that answers my question. I don't see the "nothing is requested by default except groups" in the SLEP but maybe I'm overlooking it? |
We accept arbitrary How would it be backward incompatible to not support it in a new feature? And what are the current scenarios that we are supporting? I guess implementation wise it's not that big of a difference but it would allow us to know ahead of time what we need to slice in cross-validation, and it would make for a more straight-forward mental model. If we want to explicitly allow arbitrary metadata and leave the door open for per-column meta-data we should say that explicitly, and motivate that explicitly, I think. |
I assume a meta-estimator/pipeline requests a particular meta-data iff any of the sub-estimators request that meta-data, and so to not get an error, some component of a meta-estimator or pipeline needs to use each piece of meta-data? |
|
Setting apart the implementation, I think my main concern is whether we want to restrict to sample-aligned meta-data, because I think using array shape to decide when to slice something is evil. Also, I don't like adding so many methods to the top level object namespace, in particular methods without a common prefix that share a prefix with the most commonly used methods, that are relevant only for a very small portion of users. Though there could be some fixes for that that don't require changes to the essence of the SLEP (two I can think of are adding a |
This is only true in the core library because of the distinction between Group and non-Group splitters where in the former |
Ok, I was more talking about saying "in the core library nothing [but groups] is requested by default" explicitly in the doc. |
In most places where we have a meta-estimator accepting and forwarding However, I do agree we should be able to be explicit on what metadata is sample metadata, what's column metadata, and what's data(set) metadata, and that's not hard to add to the existing API. For instance, you could think of: est = MyEstimator().fit_requests(sensitive_attribute=True, sensitive_attribute_type="row")
# or
est = MyEstimator().fit_requests(sensitive_attribute=MetadataInfo(request=True, type="row"))or a variation of the above. It's not too hard to extend the proposed API to include this information, and I'm happy to have that implemented once the proposed API by this SLEP is implemented. Adding that wouldn't even need a SLEP IMO. But adding it here makes the SLEP and the implementation unnecessarily large, and they're already large enough that we're having a hard time getting them accepted or merged.
We already do, this SLEP is not about that. It's about what we do in
Yes, clarified in the SLEP.
I guess there's gonna be something in whatever solution we pick that somebody doesn't like 😁 . We've gone through a few iterations on the method(s) exposed to the user to set the metadata requests.
If we add an accessor or a prefix to the request methods, it makes the current arguably verbose code even more verbose: est = SimplePipeline(
transformer=ExampleTransformer()
# we transformer's fit to receive sample_weight
.fit_requests(sample_weight=True)
# we want transformer's transform to receive groups
.transform_requests(groups=True),
classifier=RouterConsumerClassifier(
estimator=ExampleClassifier()
# we want this sub-estimator to receive sample_weight in fit
.fit_requests(sample_weight=True)
# but not groups in predict
.predict_requests(groups=False),
).fit_requests(
# and we want the meta-estimator to receive sample_weight as well
sample_weight=True
),
)would become est = SimplePipeline(
transformer=ExampleTransformer()
# we transformer's fit to receive sample_weight
.metadata.fit_requests(sample_weight=True)
# we want transformer's transform to receive groups
.metadata.transform_requests(groups=True),
classifier=RouterConsumerClassifier(
estimator=ExampleClassifier()
# we want this sub-estimator to receive sample_weight in fit
.metadata.fit_requests(sample_weight=True)
# but not groups in predict
.metadata.predict_requests(groups=False),
).metadata.fit_requests(
# and we want the meta-estimator to receive sample_weight as well
sample_weight=True
),
)Also, I don't agree with you that having fewer top level methods should be a motivation for setting the API here. I would agree if we had tons of methods, but we don't, our API doesn't have many top level methods, and if anything, having it top level and prefixed with the name of the corresponding method, makes it easy for users to be reminded by their IDE's auto-complete that if there's a metadata passed to the method. And as for the majority of the users, they wouldn't need to change any code or write any new code anyway. One thing which may make you happy is that the
Done. |
|
Re @adrinjalali suggestion of passing In short, the SLEP will handle metadata that does not require splitting in cross-val beautifully, as long as the cross-val estimators become aware of how these requests will be marked. I will admit I find |
I love this design, and is easy to implement. But I would rather not have it in the initial PR, and add it later (before merging into |
I don't think this is a controversial change (although I personally really prefer |
Sorry I know I'm late to the game, can you point me towards this agreement? I'm not convinced by "we do the wrong thing for fit right now, we should do the same wrong thing for everything else", because despite what you say, that's really hard to change later and would be an incompatible change.
I would argue exactly the other way around. These are methods that 95% of our users won't need. But every time I autocomplete I'm not teaching many workshops right now but I can so imagine people asking me "what does this |
|
@amueller the last two commits to the SLEP change the text regarding both your latest points, does that mean you're now a +1? 😁 |
|
FWIW I think having a method per meta-data accepting method is the least ugly design ;) |
|
Lol does the renaming invalidate previous votes? Ok I'm happy now. I'd really love there to be validation of meta-data but we can discuss this after this slep is accepted again maybe? I definitely agree with the rest of the overall design. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that I never voted. But based on the discussions that we had before proposing this SLEP to vote, I am +1.
That's what I thought too, and that's why I had it that way at first. But now that I've worked with the other way around for a while, I find it quite neat actually.
I expect people to raise their voices if they disagree with the changes. I expect the changes not to be too controversial, and the ones introduced in the process of this vote have arguably improved the SLEP. |
What sort of validation? We do some, but I'm not sure what you mean. |
Wait I thought I agreed with the current design? That was at least my intention. Right now it's one method to set the meta-data per accepting method, right?
Validating that it's per-row. |
Yes, I had parsed your sentence wrong. We're all on the same page now :D |
I don't think that needs a SLEP. I very much agree we should do it, and this SLEP allows us to implement the validation once metadata routing is implemented. I think it's relatively a minor issue since we probably don't want to do the validation everywhere, probably only in |
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thanks for all the links to already rendered docs. It makes it much easier to review !
I find the proposed API quite satisfying given its purpose. It's not overly complicated and verbose from the user point of view, which is one of the main goals imo.
Most of my concerns were answered in the previous comments. I only have one concern left, regarding verbosity. I'm afraid that we end up having to do something like the following all over the place:
log_reg = LogisticRegression().fit_requests(
sample_weight=True
).predict_requests(
sample_weight=True
).predict_proba_requests(
sample_weight=True
).score_requests(
sample_weight=True
)Not sure we'd ever want to request for all methods though :)
Have you considered having a single requests method that accepts a dict describing the map "method" <-> "metadata" ? It would make it easier to have a reusable dict for different models that you pass around or other stuff like that.
Anyway it's probably more like an implementation detail which should be discussed in the implem PR instead of in the SLEP.
|
@jeremiedbb that's how it was at the beginning, but there's no good way for having a good signature there. Also, your example doesn't happen since |
@jeremiedbb, does |
I prefer the current state, not a fan of the dunder :) set_request({
"fit": {"sample_weight": True},
"score": {"sample_weight": True}
})But it's not ideal either, and I can see that it might not necessarily be more understandable or recallable for the user. And anyway, as pointed out, metadata will almost always be requested by 1 or 2 methods only so it's kind of a detail. |
That's pretty literally what we had in an early iteration, and we ended up moving away from it since it's not easy for users to remember how to write it down. Especially when you have an alias, it becomes: set_request({
"fit": {"sample_weight": "first_weights"},
"score": {"sample_weight": "second_weights"}
})And it's confusing if the dictionary is |
|
This is very exciting. Counting the votes, we have 8 (voted here plus myself) in favor, and none against. A nice pass, and thanks for everybody involved. Still, reviewing scikit-learn/scikit-learn#22083 is on the table ;) |
|
Congrats @adrinjalali. Thank you for pushing this through! |
This PR is for us to discuss and collect votes for SLEP006 - Metadata Routing (was sample props).
A rendered version of the SLEP is available here, and detailed past discussions can be found under these PRs and these issues.
The current proposed implementation is drafted under scikit-learn/scikit-learn#22083 where you can find a rendered version of the user guide here and a rendered version of the developer API (only applicable to third party developers and people who write custom estimators) here. These are found under
metadata_routing.rstandplot_metadata_routing.pyunder the aforementioned PR respectively. Note that this PR does NOT touch scorers, splitters, or any of the estimators or meta-estimators in scikit-learn. It implements the machinery inBaseEstimatorfor consumers only. The PR is also not targeted tomain, and instead it's to be merged into thesample-propsbranch on the main repo, with follow-up PRs to complete the implementation before merging intomain.Please leave your votes here, and comment here, on the mailing list, or open new PRs/issues in this repo or the main repo for specifics if you think further discussion is required.
Note that this vote is not to merge the implementation PR, and is rather to accept the SLEP, and the SLEP does NOT discuss implementation details and the API for us and third party developers; but we're more than happy to discuss the implementation and the API during this voting round.
We also plan to send out a survey asking third party developers for their opinion of our proposed developer API, parallel to this call for vote.
This vote will close on February 21, 2022.