Skip to content

Resampler estimators that change the sample size in fitting #3855

@jnothman

Description

@jnothman

Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )

As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipelines where a transformation is applied both at fit and predict time (although a hack might abuse fit_transform to make this not so). Pipelines of Transformers also would not cope with changes in the sample size at fit time for supervised problems because Transformers do not return a modified y, only X.

To handle this class of problems, I propose introducing a new category of estimator, called a Resampler. It must define at least a fit_resample method, which Pipeline will call at fit time, passing the data unchanged at other times. (For this reason, a Resampler cannot also be a Transformer, or else we need to define their precedence.)

For many models, fit_resample needs only return sample_weight. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample should return altered data directly, in the form of a dict with keys X, y, sample_weight as required. (It still might be appropriate for many Resamplers to only modify sample_weight; if necessary, another Resampler can be chained that realises the weights as replicated or deleted entries in X and y.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIModerateAnything that requires some knowledge of conventions and best practicesNew Featurehelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions