Skip to content

Cross-Validation API for v1.0 #2487

@rogancarr

Description

@rogancarr

Hi All,

As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:

IDataView data; // training data
IEstimator<ITransformer> estimator; // Model to fit
int numFolds; //Number of folds to make
string labelColumn; // The label
string stratificationColumn; // The column to stratify on
seed; // The seed

and returns an array of

RegressionMetrics metrics;
ITransformer model;
IDataView scoredTestData;

with one entry for each fold.

I have a few questions:

  1. Are we happy with the outputs?
    I'm not overly concerned with these, but it will be hard to make this list smaller as we go.
  2. Do we need to specify labelColumn?
    Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels.
  3. Are we using the right terminology for stratification?
    Stratification usually means that ratios of classes are maintained across splits (see stratified sampling on wikipedia). Here, stratification means that items with the same value are clumped into the same split. The former makes sense if you want to maintain class ratios, especially with highly imbalanced classes, while the latter is useful for things like ranking (e.g. groupIds) or where leakage due to something like ordering may be a concern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Doc bugs, questions, minor issues, etc.questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions