-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )
As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipelines where a transformation is applied both at fit and predict time (although a hack might abuse fit_transform to make this not so). Pipelines of Transformers also would not cope with changes in the sample size at fit time for supervised problems because Transformers do not return a modified y, only X.
To handle this class of problems, I propose introducing a new category of estimator, called a Resampler. It must define at least a fit_resample method, which Pipeline will call at fit time, passing the data unchanged at other times. (For this reason, a Resampler cannot also be a Transformer, or else we need to define their precedence.)
For many models, fit_resample needs only return sample_weight. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample should return altered data directly, in the form of a dict with keys X, y, sample_weight as required. (It still might be appropriate for many Resamplers to only modify sample_weight; if necessary, another Resampler can be chained that realises the weights as replicated or deleted entries in X and y.)