Skip to content

sklearn.compose.ColumnTransformer do not keep transformers` desired dtype of output #24182

@idanmoradarthas

Description

@idanmoradarthas

Hi,

We are using ColumnTransformer as our unified preprocessor to transform the data. We have the following transformers:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, FunctionTransformer



mean_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])
constan_one_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=1.0))])
constan_zero_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0.0))])

categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', dtype=np.int8))])

ordinal_categories = [determine_mapping(ordinal_feature) for ordinal_feature in ordinal_features]

ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                                          ('ordinal', OrdinalEncoder(categories=ordinal_categories, dtype=np.int8))])

From them we create our ColumnTransformer object:

from sklearn.compose import ColumnTransformer



preprocessor = ColumnTransformer(
        transformers=[
            ('num_mean', mean_imputer_pipline, mean_imputer_features),
            ('num_constant_one', constan_one_imputer_pipline, constant_one_imputer_features),
            ('num_constant_zero', constan_zero_imputer_pipline, constant_zero_imputer_features),
            ('cat', categorical_transformer, categorical_features),
            ('ordinal', ordinal_transformer, ordinal_features)],
        remainder="drop")

When we use the transform method the desired dtypes of output for categorical_transformer and ordinal_transformer that should be np.int8 isn't kept. What we are receiving is numpy's ndarray that its dtype is np.float64.

We debuged the code, and found out that in order to join the outputs from the different transformers the ColumnTransformer is using the method np.hstack which creates a unified 1 ndarray with one dtype.

Is there a solution?
If not:

  1. Please add a warning in the documentation of the ColumnTransformer that the concatenate results won't use the desired dtype of output of the transformers.
  2. Allow us an option to receive the result without concatenation, as a list of ndarrays.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions