Skip to content

Handle np.nan / missing values in SplineTransformer #26793

@ogrisel

Description

@ogrisel

I think it would be quite natural to add an option to SplineTransformer to accept inputs with missing values as follows:

  • handle_missing="error" the default (keep current behavior)
  • handle_missing="zero"/"constant": encode missing values by setting all output features for that input column to 0 (or some other constant, see discussion below),
  • handle_missing="indicator": append an extra binary feature as missingness indicator and encode missing values as 0 on the remaining output features.

Note that handle_missing="indicator" would be different and statistically more meaningful than SimpleImputer(strategy="mean", add_indicator=True) with SplineTransformer and furthermore would make for leaner ML pipelines (better UX).

I am not sure if we need to add the handle_missing="zero" option. It would break the property the sum of output values of a given SplineTransformer encoding always sum to 1 while handle_missing="indicator" would preserve this property (in addition to make missingness more explicit to the downstream model in case missingness is informative one way or another).

If we want to preserve the sum to 1 property while not adding an explicit missingness indicator feature, maybe we could instead provide handle_missing="constant" (not sure about the name) that would encode missing values as 1 / n_outputs to preserve the "sum to 1" property. Not entirely sure if this would result in a more interesting prior than the zero encoding.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions