-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Add libsvm-like calibration procedure to CalibratedClassifierCV #17856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@NicolasHug, this is probably not going to be a problem but I thought I would mention it. We use |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow , made a first pass.
I think the refactoring makes sense, WDYT?
I haven't looked in detail at the changes related to the label encoders though. Are these needed here or can they wait for another PR?
sklearn/calibration.py
Outdated
| # When binary, proba of clf_fitted.classes_[1] | ||
| # output but `pos_class_indices` = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that the grammar is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended to:
# When binary, df consists of predictions for
# clf_fitted.classes_[1] but `pos_class_indices` = 0
not sure if this is clearer
sklearn/calibration.py
Outdated
| if base_estimator_method == "decision_function": | ||
| if df.ndim == 1: | ||
| df = df[:, np.newaxis] | ||
| else: | ||
| if len(self.label_encoder_.classes_) == 2: | ||
| df = df[:, 1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplication is a bit unfortunate, there should be a way to merge this logic with that of _get_predictions
Maybe by extracting this into a _reshape_df utility?
Or maybe we can even let _get_predictions call cross_val_predict somehow, so that we can also merge the method checks above.
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we can even let
_get_predictionscallcross_val_predictsomehow, so that we can also merge the method checks above.
This is what I was trying to do before - but _get_predictions is is used within _fit_calibrator which is within the cv for loop. This is why I wanted to put the cv for loop within _get_predictions but it got complicated.
I will can add a _reshape_df function but I have a little dislike for the df naming as it is not always 'decision_function`. (I also first think df=dataframe but maybe that is just me) WDYT?
(of course if this naming is used elsewhere, happy to keep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I can't think of an elegant way to add the _reshape_df function as we get the predictions of the classifier in _get_predictions but we use cross_val_predict to get the predictions in the case above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was confused by df too. We can rename it as predictions or just preds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I can't think of an elegant way to add the _reshape_df
Would it help to also add a _get_method helper, like
def _get_method(clf):
# return decision_function (the actual method) or predict_proba or raise error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold. For ensemble=False, this won't work (for how _fit_calibrator works atm) - we want to fit once with all the predictions. We would need to take _fit_calibrator out of the cv for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold
Only when ensemble=True, right?
There's no CV loop when ensemble=False, the "loop" is in cross_val_predict and we only need to call _fit_calibrator once. Maybe I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is difficult over text! I may be confused too.
At the moment we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
# Use `cross_val_predict` to get all `preds`
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`we want to get rid of the last else as there is duplication of code there. With the new _get_prediction_method we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, which first calls `_get_predictions`.
# In case of `ensemble=False`, we use use `partial(cross_val_predict())` (this works fine)
# `_fit_calibrator` also fits calibrators and produces `_CalibratedClassiferPipeline` instance
# Only one _CalibratedClassiferPipeline` is required when `ensemble=False`
# (i.e. shouldn't be in the for loop)
# when `ensemble=True` `n_cv` `_CalibratedClassiferPipeline` instances are needed
# (i.e., should be in the for loop) There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear! I don't think we can move away from the first version (with the last else). The duplication of code that I wanted to avoid was:
- getting the right method
if hasattr(base_estimator, "decision_function"):
base_estimator_method = "decision_function"
elif hasattr(base_estimator, "predict_proba"):
base_estimator_method = "predict_proba"- reshaping the predictions
if base_estimator_method == "decision_function":
if preds.ndim == 1:
preds = preds[:, np.newaxis]
else:
if len(self.label_encoder_.classes_) == 2:
preds = preds[:, 1]Right now, these 2 things happen both in CalibratedClassifierCV.fit (for ensemble=False) and in _get_predictions (for ensemble=True).
So to expand on my previous suggestion:
def get_prediction_method(clf, return_string=False):
# return clf.decision_function or clf.predict_proba or their corresponding string
# we need strings when using cross_val_predict
def _get_predictions(X, pred_func):
# pred_func is either a method of the clf, or a partial cross_val_predict
preds = pred_func(X)
# reshape preds here
return preds
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier on train
preds = _get_predictions(test, get_prediction_method(clf, return_string=False)
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
preds = _get_predictions(X, partial(cross_val_predict(..., method=get_prediction_method(clf, return_string=True)))
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`I think we might need to modify _fit_calibrator to only accept preds and not call _get_predictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI:
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?
it seems that it is not the same, you get 2d output with df[:, 1:]
There wasn't a big change to this. Previously we passed the classes to scikit-learn/sklearn/calibration.py Lines 180 to 181 in fd23727
and then instantiate a new scikit-learn/sklearn/calibration.py Lines 325 to 329 in fd23727
(note: I just changed it so we do not pass the classes but the fitted |
|
I've make the changes. I like that I do think that it is a bit of duplication to have both would do the trick. But maybe that is a separate issue. |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow , made another pass.
I think the use of partial is appropriate and the refactoring looks good to me. We'll need tests for the new ensemble parameter ;)
I do think that it is a bit of duplication to have both _check_classifier_response_method and get_prediction_method [...]
Yes we can consider having a unified utility, but indeed this would be better done in a separate PR.
|
ping @NicolasHug @ogrisel |
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good to me. Thank you very much @lucyleeow for your patience.
Here are some fixes / small suggestions for further improvements but otherwise it seems ready for second review and merge.
Please add a new entry to doc/whats_new/0.24.rst.
|
Thanks for the review @ogrisel, suggestions added! |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. Just a couple of nitpicks.
|
Merged! Thank you very much @lucyleeow! |
| sparse.csr_matrix(X_test))]: | ||
| cal_clf = CalibratedClassifierCV( | ||
| clf, method=method, cv=2, ensemble=ensemble | ||
| clf, method=method, cv=5, ensemble=ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used cv=5 whenever we assess the quality of the calibration because cv=2 trains models on much small dataset and can degrade the base model performance, hence make the test quite brittle.
Reference Issues/PRs
closes #16145
Follows from PR #16167
The first step was to refactor
CalibratedClassifierCV, which was done in #17803. This is a branch off of #17803. (will close #17803)What does this implement/fix? Explain your changes.
Add
ensembleoption toCalibratedClassifierCV.ensemble=Truewould be the default and would be the current implementation.ensemble=Falsewould be the same as the libsvm implementation (used whenprobability=True):When
cvis not'prefit':Note
ensemble=Falseis the procedure described in the original Platt paper (see #16145 (comment))Any other comments?