Purged K-Fold CV #3909

Patschkowski · 2025-03-12T23:08:02Z

PR for #3830.

@rcurtin , let me clean this PR first before you have a look. I hope I have a better understanding now on how you assess the PRs so that we can have less iterations. I'll let you know once I think the quality of the PR is good enough

Co-authored-by: Ryan Curtin <[email protected]>

rcurtin

Thanks for implementing this @Patschkowski. The implementation actually seems pretty straightforward.

It looks like there are a couple of Jenkins build failures, but the logs are not easy to access. I am working on improving that in the background...

Do you think you could add something to the documentation in doc/user/cv.md? Maybe a section like

## The `PurgedKFoldCV` class

below the "The KFoldCV and SimpleCV classes" section. It should suffice to introduce the technique and the situation it's aimed at (perhaps using the figure that you showed in the original issue), then the constructors that are a little different from KFoldCV and SimpleCV.

Eventually I will work that documentation into the API-level documentation on its own page, like we have for RandomForest and other classes, and I can use what you write here when that time comes.

I haven't gone through the details of the implementation yet but at a first glance it looks good. I have a couple clarification questions before I dig in. Thanks again!

rcurtin · 2025-03-26T15:30:54Z

src/mlpack/tests/cv_test.cpp

+        labels,
+        numClasses));
+  }
+};


Are you sure this class is necessary? RandomForest should already work with k-fold cross-validation as it is (it should have the right form of Train() overload).

My perspective is the following: RandomForest's Train method has a double return type, while MetaInfoExtractor requires void return type.

I'm not sure it requires a void type though---DecisionTree::Train() returns double but works with KFoldCV:

https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests/cv_test.cpp#L638

Debugging the templated code can be a little bit of a nightmare, but I'm happy to give it a shot if you have a minimum reproducible example of the failure of KFoldCV with RandomForest or similar.

@rcurtin please help with the debugging.

Wow, it took a bit more digging than I expected, but it turns out there was an issue with RandomForest having too many parameters for MetaInfoExtractor (actually, amusingly, that problem was introduced in #3829 but we did not notice there). I opened a fix in #3941; if you pull that into this branch, then you should be able to use RandomForest directly.

Now #3941 is merged, so, if you update against the master branch, I think everything here should be fixed (hopefully!).

src/mlpack/tests/cv_test.cpp

src/mlpack/core/cv/purged_k_fold_cv.hpp

rcurtin · 2025-03-26T15:39:21Z

src/mlpack/core/cv/purged_k_fold_cv.hpp

+      double embargoPercentage,
+      const MatType& xs,
+      const PredictionsType& ys,
+      const MatType& intervals);


Also, is it always necessary to pass the intervals matrix? Or does it make sense to have a default argument (or similar) such that when intervals is not specified, we assume that column i of xs corresponds only to time step i---so there is no overlap in samples.

That would match the simple case in the image given with #3830, where xs looks like stock prices (or something like this) and each sample in xs is just the price at one time step. In that case it seems like you would still want to do purged k-fold CV, but the intervals matrix would just be of the form, e.g.,

0, 0 1, 1 2, 2 3, 3 ...

To me it seems like this would be the most common case to use PurgedKFoldCV, so having a 'default' for intervals could be useful. What do you think?

We now already have 6 constructors. I think defaulting that argument will introduce another 6 sets of constructors so I'd go for the current solution. The user is free to supply their intervals matrix as you suggested at any time.

Agreed, we definitely don't need to make it 12, but I just mean defaulting the argument, like: const MatType& intervals = MatType(). Or you could do const std::optional<MatType> intervals = std::nullopt (take a look at, e.g., src/mlpack/methods/bayesian_linear_regression.hpp where std::optional is used). That would prevent the need for a ton of additional constructors.

I still have my concerns about this design. In case we use a defaulted intervals, then one could also use that standard KFoldVC instead of the purged version.

But to my understanding, PurgedKFoldCV with a defaulted intervals would give a different result than just KFoldCV, due to the embargoPercentage parameter. Suppose that we have a dataset where the intervals matrix would be

0, 0 1, 1 2, 2 ... 99, 99

as above, and we set embargoPercentage to 0.05 (wait, should it be a percentage, like 5, or maybe should we rename the parameter just embargo and specify it should be in [0, 1]? but that is an aside). In the super simple case of k = 2, for the first CV split we should get:

training set: samples 0 through 49

embargoed and unused: samples 50 through 54

validation set: samples 55 through 99

(Also this raises a question or two for me about embargoing: for k=3 and greater, where you have a validation set in the middle of the data, you want the embargo applied both before the validation set starts and after it ends, right?)

I agree that if the user sets embargoPercentage to 0, then a defaulted intervals would be the same as regular KFoldCV and the user should use that instead---but I think the embargo makes it different. Let me know if I overlooked something.

rcurtin

Hey @Patschkowski, I took a closer look at the implementation. The general design looks just fine to me, but I have some questions about the details of how the training and validation sets are constructed---do they use the time indexes in intervals or the column index (as KFoldCV does)? It seems to me that the time indexes are what should be used.

src/mlpack/core/cv/purged_k_fold_cv.hpp

rcurtin · 2025-04-09T13:26:55Z

src/mlpack/core/cv/purged_k_fold_cv.hpp

+      double embargoPercentage,
+      const MatType& xs,
+      const PredictionsType& ys,
+      const MatType& intervals);


Agreed, we definitely don't need to make it 12, but I just mean defaulting the argument, like: const MatType& intervals = MatType(). Or you could do const std::optional<MatType> intervals = std::nullopt (take a look at, e.g., src/mlpack/methods/bayesian_linear_regression.hpp where std::optional is used). That would prevent the need for a ton of additional constructors.

rcurtin · 2025-04-09T13:34:44Z

src/mlpack/tests/cv_test.cpp

+        labels,
+        numClasses));
+  }
+};


I'm not sure it requires a void type though---DecisionTree::Train() returns double but works with KFoldCV:

https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests/cv_test.cpp#L638

Debugging the templated code can be a little bit of a nightmare, but I'm happy to give it a shot if you have a minimum reproducible example of the failure of KFoldCV with RandomForest or similar.

src/mlpack/tests/cv_test.cpp

src/mlpack/core/cv/k_fold_cv_base.hpp

src/mlpack/core/cv/k_fold_cv_base_impl.hpp

src/mlpack/core/cv/purged_k_fold_cv_impl.hpp

rcurtin · 2025-04-09T14:13:17Z

src/mlpack/core/cv/purged_k_fold_cv_impl.hpp

+{
+  assert(i < this->k);
+
+  return (this->k - i - 1) * this->binSize;


So, this is saying that the first column of the validation set is a certain column in xs, just the same as KFoldCV. But don't we have to consider the time steps that the sample covers? Do samples need to be sorted in order of their start time as represented in intervals? I would have imagined that we would instead taken the whole number of time steps in the dataset (e.g. intervals.max() - intervals.min()) and split these into k folds, and so in this function we would return a time step instead of an index of a column in xs.

However, I am not sure which of the possibilities I just wrote is the actual way that this function should be. I suspect that my confusion would be alleviated if documentation explaining the precise way data is split was added to cv.md 😄

I think the assumption is indeed that the xs is already sorted. If there was a timestep feature in xs one could use that instead. And then also your statement regarding invariant-to-shuffling in the other remark would make sense.

rcurtin · 2025-04-09T14:15:37Z

src/mlpack/tests/cv_test.cpp

+    intervals(0, i) = static_cast<arma::mat::elem_type>(i);
+    intervals(1, i) = static_cast<arma::mat::elem_type>(
+        std::min(i + 4, ds.n_cols - 1));
+  }


I feel like it would be a good idea to add an additional test which, after building this staircase dataset, shuffles all the columns (both in xs and ys and intervals), and runs the same test. If PurgedKFoldCV is splitting into folds based on the time indexes in intervals, then the algorithm should be invariant to the ordering of points given to it.

I think in the current implementation PurgedKFoldCV is not invariant to the shuffling you are proposing

rcurtin

Hopefully these comments are helpful; let me know if I can clarify anything.

rcurtin · 2025-05-13T21:45:07Z

doc/user/cv.md

-
-In addition, the [hyperparameter tuner](hpt.md) documentation may also be
-relevant.
+# Cross-Validation


I think the diff is so weird here probably because the line endings changed from Unix-style to DOS-style. Do you think you can revert it to make the diff easier to read? (or if you want I can quickly run dos2unix if my hunch is right and push to this branch, just let me know)

rcurtin · 2025-05-13T21:54:03Z

src/mlpack/core/cv/purged_k_fold_cv.hpp

+      double embargoPercentage,
+      const MatType& xs,
+      const PredictionsType& ys,
+      const MatType& intervals);


But to my understanding, PurgedKFoldCV with a defaulted intervals would give a different result than just KFoldCV, due to the embargoPercentage parameter. Suppose that we have a dataset where the intervals matrix would be

0, 0 1, 1 2, 2 ... 99, 99

as above, and we set embargoPercentage to 0.05 (wait, should it be a percentage, like 5, or maybe should we rename the parameter just embargo and specify it should be in [0, 1]? but that is an aside). In the super simple case of k = 2, for the first CV split we should get:

training set: samples 0 through 49

embargoed and unused: samples 50 through 54

validation set: samples 55 through 99

(Also this raises a question or two for me about embargoing: for k=3 and greater, where you have a validation set in the middle of the data, you want the embargo applied both before the validation set starts and after it ends, right?)

I agree that if the user sets embargoPercentage to 0, then a defaulted intervals would be the same as regular KFoldCV and the user should use that instead---but I think the embargo makes it different. Let me know if I overlooked something.

rcurtin · 2025-05-13T21:55:08Z

src/mlpack/core/util/sfinae_utility.hpp

- */
-
-#endif
+/**


I think this file had its line endings changed too. Can you change them back? (or let me know if you want me to)

rcurtin · 2025-05-13T21:55:43Z

src/mlpack/core/cv/meta_info_extractor.hpp

-
-} // namespace mlpack
-
-#endif


I think this file had its line endings changed too. Can you change them back? (or let me know if you want me to)

rcurtin · 2025-05-15T14:49:21Z

src/mlpack/tests/cv_test.cpp

+        labels,
+        numClasses));
+  }
+};


Wow, it took a bit more digging than I expected, but it turns out there was an issue with RandomForest having too many parameters for MetaInfoExtractor (actually, amusingly, that problem was introduced in #3829 but we did not notice there). I opened a fix in #3941; if you pull that into this branch, then you should be able to use RandomForest directly.

rcurtin · 2025-07-10T07:09:54Z

@Patschkowski happy to pick this one back up if you have time. 👍

Felix Patschkowski and others added 30 commits November 10, 2024 18:16

Added sequential bootstrapping to random forest.

2c23630

Merge branch 'mlpack:master' into master

dbea84f

Fixed namespace.

6390fd0

Merge branch 'master' of https://github.com/Patschkowski/mlpack

daa0b08

Fixed style checks.

74031af

Fixed compilation issue.

d0fcb45

Comply to __libcpp_random_is_valid_urng

91d17d4

Incorporated review feedback

4007ee3

Corrected style checks.

dfda309

Reverted documentation.

f630952

Made constructor implicit to catch armadillo template optimizations

735782c

Fixed compilation error.

3ca61d4

Fixed style check.

11c06ba

Fixed compilation error.

b8f9312

Align argument lists between public and private Train function.

788d506

Explicitly state template use to help GCC

757a0be

Fixed style check.

af2ca2c

Merge branch 'mlpack:master' into master

abca875

Merge branch 'mlpack:master' into master

0ca8631

Merge branch 'mlpack:master' into master

a4443bb

Added name to contributor list.

1ab1546

Merge branch 'master' of https://github.com/Patschkowski/mlpack

ebc908e

Changed indicator matrix to interval matrix

5299834

Updated docs, implementation and testing after code review.

0d78abe

Fixed style warnings.

80da1da

Updated documentation.

aa1006a

Update src/mlpack/methods/random_forest/bootstrap.hpp

0ca054d

Co-authored-by: Ryan Curtin <[email protected]>

Update doc/user/methods/random_forest.md

b5086c7

Co-authored-by: Ryan Curtin <[email protected]>

Incorporating code review feedback

9f42560

Merge branch 'master' of https://github.com/Patschkowski/mlpack

41ddd3b

Patschkowski added 2 commits March 25, 2025 05:38

Fixing Linux build

c87d9bd

Fixing unit tests

d28ddfe

rcurtin reviewed Mar 26, 2025

View reviewed changes

Patschkowski added 2 commits April 6, 2025 14:27

Incorporated feedback from code review

cecd97a

Fixed order of arguments

8d55361

rcurtin reviewed Apr 9, 2025

View reviewed changes

Patschkowski and others added 13 commits April 29, 2025 12:35

Merge branch 'mlpack:master' into purged-k-fold-cv

7373f03

Fixed formatting

2af24ba

Fixed visibility

c0e182e

Corrected typo

ac3e06e

Fixed compilation errors.

d784be6

Fixed integer comparison warning.

14e7a85

Fixed compilation issue

2265ab5

Fixed whitespace

6b261a7

Make protected members public.

0376592

Fixed visibility.

e5d6dba

Returned to facade until templates are clarified

9655ef5

Fixed GCC compilation error

93502d0

Fixed style issues

ffd9955

rcurtin mentioned this pull request May 15, 2025

Fix KFoldCV with RandomForest #3941

Merged

rcurtin reviewed May 15, 2025

View reviewed changes

github-actions bot added the s: stale label Jun 26, 2025

github-actions bot closed this Jul 10, 2025

conradsnicta reopened this Jul 15, 2025

conradsnicta added s: keep open and removed s: stale labels Jul 15, 2025

conradsnicta added s: stale and removed s: keep open labels Oct 20, 2025

github-actions bot closed this Nov 4, 2025

-               */
-              #endif
+              /**

Uh oh!

Purged K-Fold CV #3909

Purged K-Fold CV #3909

Uh oh!

Conversation

Patschkowski commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Patschkowski commented Mar 12, 2025 •

edited

Loading