-
Notifications
You must be signed in to change notification settings - Fork 0
Use sample_weight in the metric calculations in model validations and cross validation #2
Conversation
ryan-deak-zefr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a few places that I am concerned about.
| # test_sample_weight_sums. | ||
| if self.return_train_score: | ||
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are the hairiest of all. I'm not 100% certain about whether this is the right way to do it but unfortunately, I don't see a better way because the total dataset size isn't maintained anywhere inside the function. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not store the total data set size in some way and pass it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is possible. The reason I say this is that we can't necessarily just take the data set size because I don't think that the cross validation iterator (the cv parameter in GridSearchCV) must iterate over the entire dataset. So it should be based on the sizes defined by cv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I put a comment to this effect in the code?
| else: | ||
| out, train_wts, test_wts = ([], [], []) | ||
|
|
||
| train_wts = collapse_nones(train_wts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns a list of None into a None.
sklearn/model_selection/_search.py
Outdated
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) | ||
| train_sample_counts = \ | ||
| np.array(test_sample_counts[:n_splits], dtype=np.int) - \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why this difference is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachary-mcpherson-zefr: Good catch! The arguments are backward. Will fix. In the meantime, here's the reasoning (assuming the code was correct):
Say that you have the following three folds for cross validation:
- fold 1: 50 examples
- fold 2: 51 examples
- fold 3: 49 examples
So it typically works by iterating over the test fold. So let's say that we are on the second loop iteration and therefore, fold 2 is the test fold. That means that folds 1 and 3 are the training folds. So, there should be 99 training examples for test fold 2.
samples should represent the total number of rows of data in the entire dataset. Then, to get the number of training rows, we can subtract the number of testing rows from the total to get the number of training rows associated with the test fold. So, back to our example, if the test fold is fold 2, we have samples = 150 and fold 2 is 51, then 150 - 51 == 99. Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense, thanks!
| # test_sample_weight_sums. | ||
| if self.return_train_score: | ||
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not store the total data set size in some way and pass it in?
| return x is None | ||
|
|
||
| def collape_nones(xs): | ||
| return None if xs is None or any(map(is_none, xs)) else xs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the function can be written like with a generator instead of a map?
def collapse_nones_g(xs):
if xs is None or any(x is None for x in xs):
return None
return xsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already a generator:
In [32]: %time sum(range(200000000))
CPU times: user 3.86 s, sys: 9.2 ms, total: 3.86 s
Wall time: 3.86 s
In [34]: %timeit any(map(lambda x: x > 2, range(200000000)))
940 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
| (precision_score, True, False, [2000000, 1000000, 1, 999999]), | ||
| (log_loss, False, True, [2500000, 500000, 200000, 100000]), | ||
| (brier_score_loss, False, True, [2500000, 500000, 200000, 100000]), | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test more scorers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary. We could, but this is way better than what seems to currently exist in sklearn. I also have https://gist.github.com/deaktator/94545f807f139eba2c8a15381f2495e0 for which I'd like to submit a future PR to sklearn. It's just test code. Not any change to implementation.
DO NOT MERGE WITH PR WITHOUT ASKING!
Purpose
To correct metric calculations in model validation (and cross validation) by attempting to use
sample_weightin the metric calculations whensample_weightare supplied infit_params.Adds sample weighting for model validation methods like cross validation inside
sklearn/model_selection/_validation.pyandsklearn/model_selection/_search.py.Reference Issues/PRs
Fixes scikit-learn#4632. See also scikit-learn#10806.
What does this implement/fix? Explain your changes.
This change always uses the same
sample_weightused in training to weight the metric calculations in_fit_and_scoreand functions called by_fit_and_score, so it should be thought of changing the default behavior of things like cross validation. This is similar to PR scikit-learn#10806 but doesn't allow separate test weights to be specified. It changes_validation.pyand_search.py.None of the CV search classes (
GridSearchCV, etc) are changed. This operates at the function level.Any other comments?
This PR provides some simple test showing how unweighted metrics have very unfavorable consequences when models are trained with importance weights but not scored with the same weights. It is shown in the test that the current omission of
sample_weightin metrics calculations considerably overestimates the accuracy.