-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Purged K-Fold CV #3909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purged K-Fold CV #3909
Conversation
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this @Patschkowski. The implementation actually seems pretty straightforward.
It looks like there are a couple of Jenkins build failures, but the logs are not easy to access. I am working on improving that in the background...
Do you think you could add something to the documentation in doc/user/cv.md? Maybe a section like
## The `PurgedKFoldCV` class
below the "The KFoldCV and SimpleCV classes" section. It should suffice to introduce the technique and the situation it's aimed at (perhaps using the figure that you showed in the original issue), then the constructors that are a little different from KFoldCV and SimpleCV.
Eventually I will work that documentation into the API-level documentation on its own page, like we have for RandomForest and other classes, and I can use what you write here when that time comes.
I haven't gone through the details of the implementation yet but at a first glance it looks good. I have a couple clarification questions before I dig in. Thanks again!
| labels, | ||
| numClasses)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this class is necessary? RandomForest should already work with k-fold cross-validation as it is (it should have the right form of Train() overload).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My perspective is the following: RandomForest's Train method has a double return type, while MetaInfoExtractor requires void return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it requires a void type though---DecisionTree::Train() returns double but works with KFoldCV:
Debugging the templated code can be a little bit of a nightmare, but I'm happy to give it a shot if you have a minimum reproducible example of the failure of KFoldCV with RandomForest or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin please help with the debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, it took a bit more digging than I expected, but it turns out there was an issue with RandomForest having too many parameters for MetaInfoExtractor (actually, amusingly, that problem was introduced in #3829 but we did not notice there). I opened a fix in #3941; if you pull that into this branch, then you should be able to use RandomForest directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now #3941 is merged, so, if you update against the master branch, I think everything here should be fixed (hopefully!).
| double embargoPercentage, | ||
| const MatType& xs, | ||
| const PredictionsType& ys, | ||
| const MatType& intervals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it always necessary to pass the intervals matrix? Or does it make sense to have a default argument (or similar) such that when intervals is not specified, we assume that column i of xs corresponds only to time step i---so there is no overlap in samples.
That would match the simple case in the image given with #3830, where xs looks like stock prices (or something like this) and each sample in xs is just the price at one time step. In that case it seems like you would still want to do purged k-fold CV, but the intervals matrix would just be of the form, e.g.,
0, 0
1, 1
2, 2
3, 3
...
To me it seems like this would be the most common case to use PurgedKFoldCV, so having a 'default' for intervals could be useful. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now already have 6 constructors. I think defaulting that argument will introduce another 6 sets of constructors so I'd go for the current solution. The user is free to supply their intervals matrix as you suggested at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we definitely don't need to make it 12, but I just mean defaulting the argument, like: const MatType& intervals = MatType(). Or you could do const std::optional<MatType> intervals = std::nullopt (take a look at, e.g., src/mlpack/methods/bayesian_linear_regression.hpp where std::optional is used). That would prevent the need for a ton of additional constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have my concerns about this design. In case we use a defaulted intervals, then one could also use that standard KFoldVC instead of the purged version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to my understanding, PurgedKFoldCV with a defaulted intervals would give a different result than just KFoldCV, due to the embargoPercentage parameter. Suppose that we have a dataset where the intervals matrix would be
0, 0
1, 1
2, 2
...
99, 99
as above, and we set embargoPercentage to 0.05 (wait, should it be a percentage, like 5, or maybe should we rename the parameter just embargo and specify it should be in [0, 1]? but that is an aside). In the super simple case of k = 2, for the first CV split we should get:
- training set: samples 0 through 49
- embargoed and unused: samples 50 through 54
- validation set: samples 55 through 99
(Also this raises a question or two for me about embargoing: for k=3 and greater, where you have a validation set in the middle of the data, you want the embargo applied both before the validation set starts and after it ends, right?)
I agree that if the user sets embargoPercentage to 0, then a defaulted intervals would be the same as regular KFoldCV and the user should use that instead---but I think the embargo makes it different. Let me know if I overlooked something.
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Patschkowski, I took a closer look at the implementation. The general design looks just fine to me, but I have some questions about the details of how the training and validation sets are constructed---do they use the time indexes in intervals or the column index (as KFoldCV does)? It seems to me that the time indexes are what should be used.
| double embargoPercentage, | ||
| const MatType& xs, | ||
| const PredictionsType& ys, | ||
| const MatType& intervals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we definitely don't need to make it 12, but I just mean defaulting the argument, like: const MatType& intervals = MatType(). Or you could do const std::optional<MatType> intervals = std::nullopt (take a look at, e.g., src/mlpack/methods/bayesian_linear_regression.hpp where std::optional is used). That would prevent the need for a ton of additional constructors.
| labels, | ||
| numClasses)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it requires a void type though---DecisionTree::Train() returns double but works with KFoldCV:
Debugging the templated code can be a little bit of a nightmare, but I'm happy to give it a shot if you have a minimum reproducible example of the failure of KFoldCV with RandomForest or similar.
| { | ||
| assert(i < this->k); | ||
|
|
||
| return (this->k - i - 1) * this->binSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is saying that the first column of the validation set is a certain column in xs, just the same as KFoldCV. But don't we have to consider the time steps that the sample covers? Do samples need to be sorted in order of their start time as represented in intervals? I would have imagined that we would instead taken the whole number of time steps in the dataset (e.g. intervals.max() - intervals.min()) and split these into k folds, and so in this function we would return a time step instead of an index of a column in xs.
However, I am not sure which of the possibilities I just wrote is the actual way that this function should be. I suspect that my confusion would be alleviated if documentation explaining the precise way data is split was added to cv.md 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption is indeed that the xs is already sorted. If there was a timestep feature in xs one could use that instead. And then also your statement regarding invariant-to-shuffling in the other remark would make sense.
| intervals(0, i) = static_cast<arma::mat::elem_type>(i); | ||
| intervals(1, i) = static_cast<arma::mat::elem_type>( | ||
| std::min(i + 4, ds.n_cols - 1)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be a good idea to add an additional test which, after building this staircase dataset, shuffles all the columns (both in xs and ys and intervals), and runs the same test. If PurgedKFoldCV is splitting into folds based on the time indexes in intervals, then the algorithm should be invariant to the ordering of points given to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the current implementation PurgedKFoldCV is not invariant to the shuffling you are proposing
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully these comments are helpful; let me know if I can clarify anything.
| In addition, the [hyperparameter tuner](hpt.md) documentation may also be | ||
| relevant. | ||
| # Cross-Validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the diff is so weird here probably because the line endings changed from Unix-style to DOS-style. Do you think you can revert it to make the diff easier to read? (or if you want I can quickly run dos2unix if my hunch is right and push to this branch, just let me know)
| double embargoPercentage, | ||
| const MatType& xs, | ||
| const PredictionsType& ys, | ||
| const MatType& intervals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to my understanding, PurgedKFoldCV with a defaulted intervals would give a different result than just KFoldCV, due to the embargoPercentage parameter. Suppose that we have a dataset where the intervals matrix would be
0, 0
1, 1
2, 2
...
99, 99
as above, and we set embargoPercentage to 0.05 (wait, should it be a percentage, like 5, or maybe should we rename the parameter just embargo and specify it should be in [0, 1]? but that is an aside). In the super simple case of k = 2, for the first CV split we should get:
- training set: samples 0 through 49
- embargoed and unused: samples 50 through 54
- validation set: samples 55 through 99
(Also this raises a question or two for me about embargoing: for k=3 and greater, where you have a validation set in the middle of the data, you want the embargo applied both before the validation set starts and after it ends, right?)
I agree that if the user sets embargoPercentage to 0, then a defaulted intervals would be the same as regular KFoldCV and the user should use that instead---but I think the embargo makes it different. Let me know if I overlooked something.
| */ | ||
|
|
||
| #endif | ||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file had its line endings changed too. Can you change them back? (or let me know if you want me to)
|
|
||
| } // namespace mlpack | ||
|
|
||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file had its line endings changed too. Can you change them back? (or let me know if you want me to)
| labels, | ||
| numClasses)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, it took a bit more digging than I expected, but it turns out there was an issue with RandomForest having too many parameters for MetaInfoExtractor (actually, amusingly, that problem was introduced in #3829 but we did not notice there). I opened a fix in #3941; if you pull that into this branch, then you should be able to use RandomForest directly.
|
@Patschkowski happy to pick this one back up if you have time. 👍 |
PR for #3830.
@rcurtin , let me clean this PR first before you have a look. I hope I have a better understanding now on how you assess the PRs so that we can have less iterations. I'll let you know once I think the quality of the PR is good enough