-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
P3Doc bugs, questions, minor issues, etc.Doc bugs, questions, minor issues, etc.questionFurther information is requestedFurther information is requested
Description
Hi All,
As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:
IDataView data; // training data
IEstimator<ITransformer> estimator; // Model to fit
int numFolds; //Number of folds to make
string labelColumn; // The label
string stratificationColumn; // The column to stratify on
seed; // The seedand returns an array of
RegressionMetrics metrics;
ITransformer model;
IDataView scoredTestData;with one entry for each fold.
I have a few questions:
- Are we happy with the outputs?
I'm not overly concerned with these, but it will be hard to make this list smaller as we go. - Do we need to specify
labelColumn?
Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels. - Are we using the right terminology for
stratification?
Stratification usually means that ratios of classes are maintained across splits (see stratified sampling on wikipedia). Here,stratificationmeans that items with the same value are clumped into the same split. The former makes sense if you want to maintain class ratios, especially with highly imbalanced classes, while the latter is useful for things like ranking (e.g.groupIds) or where leakage due to something like ordering may be a concern.
Metadata
Metadata
Assignees
Labels
P3Doc bugs, questions, minor issues, etc.Doc bugs, questions, minor issues, etc.questionFurther information is requestedFurther information is requested