Cross-Validation API for v1.0

Hi All,

As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:

```cs
IDataView data; // training data
IEstimator<ITransformer> estimator; // Model to fit
int numFolds; //Number of folds to make
string labelColumn; // The label
string stratificationColumn; // The column to stratify on
seed; // The seed
```

and returns an array of
```cs
RegressionMetrics metrics;
ITransformer model;
IDataView scoredTestData;
```
with one entry for each fold.

I have a few questions:

1) Are we happy with the outputs?
I'm not overly concerned with these, but it will be hard to make this list smaller as we go.
2) Do we need to specify `labelColumn`?
Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels.
3) Are we using the right terminology for `stratification`?
Stratification usually means that ratios of classes are maintained across splits (see [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling) on wikipedia). Here, `stratification` means that items with the same value are clumped into the same split. The former makes sense if you want to maintain class ratios, especially with highly imbalanced classes, while the latter is useful for things like ranking (e.g. `groupIds`) or where leakage due to something like ordering may be a concern.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cross-Validation API for v1.0 #2487

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cross-Validation API for v1.0 #2487

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions