Resampler estimators that change the sample size in fitting

Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )

As noted [elsewhere](https://github.com/scikit-learn/scikit-learn/pull/1454#issuecomment-11313359), transformers that change the number of samples are not currently supported, certainly in the context of `Pipeline`s where a transformation is applied both at `fit` and `predict` time (although a hack might abuse `fit_transform` to make this not so). `Pipeline`s of `Transformer`s also would not cope with changes in the sample size at fit time for supervised problems because `Transformer`s do not return a modified `y`, only `X`.

To handle this class of problems, I propose introducing a new category of estimator, called a `Resampler`. It must define at least a `fit_resample` method, which `Pipeline` will call at `fit` time, passing the data unchanged at other times. (For this reason, a `Resampler` cannot also be a `Transformer`, or else we need to define their precedence.)

For many models, `fit_resample` needs only return `sample_weight`. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think `fit_resample` should return altered data directly, in the form of a dict with keys `X`, `y`, `sample_weight` as required. (It still might be appropriate for many `Resampler`s to only modify `sample_weight`; if necessary, another `Resampler` can be chained that realises the weights as replicated or deleted entries in `X` and `y`.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resampler estimators that change the sample size in fitting #3855

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Resampler estimators that change the sample size in fitting #3855

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions