-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Pipeline should provide a method to apply its transformations to an arbitrary dataset without transform from the last classifier step.
Use case:
Boosted tree models like XGBoost and LightGBM use a validation set for early stopping.
We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.
After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at Scikit-learn level.
Idea 1, have a dummy transform method in XGBClassifier and LGBMClassifier
The transform method for pipeline/classifier is already extremely inconsistent :
- Failure because the classifier step does not implement transform
- Deprecated feature importance extraction for trees ensemble
- NN features proposition for MLPClassifier transform method in MLPClassifier #8291
- Decision path proposition for trees ensemble
transformmethod of tree ensembles should return thedecision_path#7907
Furthermore the issue will pop up again if the last classifier is an ensemble of multiple models
Idea 2, Implement a validation_split parameter for early stopping
Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in XGBClassifier and LGBMClassifier and everything else that would need a validation set for early stopping.
The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.
Hence I think Scikit-learn need a method or parameter in transform to ignore the last step or the last n steps.
If needed I can raise a related issue on having a consistent transform method for classifiers and keep this one focused on applying transform without classification on arbitrary data.