Skip to content
This repository was archived by the owner on Nov 27, 2023. It is now read-only.

Conversation

@ryan-deak-zefr
Copy link

@ryan-deak-zefr ryan-deak-zefr commented Feb 26, 2019

DO NOT MERGE WITH PR WITHOUT ASKING!

Purpose

To correct metric calculations in model validation (and cross validation) by attempting to use sample_weight in the metric calculations when sample_weight are supplied in fit_params.

Adds sample weighting for model validation methods like cross validation inside sklearn/model_selection/_validation.py and sklearn/model_selection/_search.py.

Reference Issues/PRs

Fixes scikit-learn#4632. See also scikit-learn#10806.

What does this implement/fix? Explain your changes.

This change always uses the same sample_weight used in training to weight the metric calculations in _fit_and_score and functions called by _fit_and_score, so it should be thought of changing the default behavior of things like cross validation. This is similar to PR scikit-learn#10806 but doesn't allow separate test weights to be specified. It changes _validation.py and _search.py.

None of the CV search classes (GridSearchCV, etc) are changed. This operates at the function level.

Any other comments?

This PR provides some simple test showing how unweighted metrics have very unfavorable consequences when models are trained with importance weights but not scored with the same weights. It is shown in the test that the current omission of sample_weight in metrics calculations considerably overestimates the accuracy.

@ghost ghost added the wip label Feb 26, 2019
@ghost ghost assigned ryan-deak-zefr Feb 26, 2019
Copy link
Author

@ryan-deak-zefr ryan-deak-zefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a few places that I am concerned about.

# test_sample_weight_sums.
if self.return_train_score:
if train_sample_weight_sums is None:
samples = int(np.sum(test_sample_counts[:n_splits]))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are the hairiest of all. I'm not 100% certain about whether this is the right way to do it but unfortunately, I don't see a better way because the total dataset size isn't maintained anywhere inside the function. Any thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not store the total data set size in some way and pass it in?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is possible. The reason I say this is that we can't necessarily just take the data set size because I don't think that the cross validation iterator (the cv parameter in GridSearchCV) must iterate over the entire dataset. So it should be based on the sizes defined by cv.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I put a comment to this effect in the code?

else:
out, train_wts, test_wts = ([], [], [])

train_wts = collapse_nones(train_wts)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns a list of None into a None.

if train_sample_weight_sums is None:
samples = int(np.sum(test_sample_counts[:n_splits]))
train_sample_counts = \
np.array(test_sample_counts[:n_splits], dtype=np.int) - \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this difference is necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachary-mcpherson-zefr: Good catch! The arguments are backward. Will fix. In the meantime, here's the reasoning (assuming the code was correct):

Say that you have the following three folds for cross validation:

  • fold 1: 50 examples
  • fold 2: 51 examples
  • fold 3: 49 examples

So it typically works by iterating over the test fold. So let's say that we are on the second loop iteration and therefore, fold 2 is the test fold. That means that folds 1 and 3 are the training folds. So, there should be 99 training examples for test fold 2.

samples should represent the total number of rows of data in the entire dataset. Then, to get the number of training rows, we can subtract the number of testing rows from the total to get the number of training rows associated with the test fold. So, back to our example, if the test fold is fold 2, we have samples = 150 and fold 2 is 51, then 150 - 51 == 99. Make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, thanks!

# test_sample_weight_sums.
if self.return_train_score:
if train_sample_weight_sums is None:
samples = int(np.sum(test_sample_counts[:n_splits]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not store the total data set size in some way and pass it in?

return x is None

def collape_nones(xs):
return None if xs is None or any(map(is_none, xs)) else xs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the function can be written like with a generator instead of a map?

def collapse_nones_g(xs):
    if xs is None or any(x is None for x in xs):
        return None
    return xs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already a generator:

In [32]: %time sum(range(200000000))
CPU times: user 3.86 s, sys: 9.2 ms, total: 3.86 s
Wall time: 3.86 s
In [34]: %timeit any(map(lambda x: x > 2, range(200000000)))
940 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

(precision_score, True, False, [2000000, 1000000, 1, 999999]),
(log_loss, False, True, [2500000, 500000, 200000, 100000]),
(brier_score_loss, False, True, [2500000, 500000, 200000, 100000]),
])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test more scorers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. We could, but this is way better than what seems to currently exist in sklearn. I also have https://gist.github.com/deaktator/94545f807f139eba2c8a15381f2495e0 for which I'd like to submit a future PR to sklearn. It's just test code. Not any change to implementation.

@ryan-deak-zefr ryan-deak-zefr merged commit 5913e40 into master Feb 27, 2019
@ryan-deak-zefr ryan-deak-zefr deleted the DAS-1145_sample_wt_validation branch February 27, 2019 19:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should cross-validation scoring take sample-weights into account?

5 participants