-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Closed
Closed
Copy link
Description
Hi,
We are using ColumnTransformer as our unified preprocessor to transform the data. We have the following transformers:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, FunctionTransformer
mean_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])
constan_one_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=1.0))])
constan_zero_imputer_pipline = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0.0))])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', dtype=np.int8))])
ordinal_categories = [determine_mapping(ordinal_feature) for ordinal_feature in ordinal_features]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ordinal', OrdinalEncoder(categories=ordinal_categories, dtype=np.int8))])From them we create our ColumnTransformer object:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num_mean', mean_imputer_pipline, mean_imputer_features),
('num_constant_one', constan_one_imputer_pipline, constant_one_imputer_features),
('num_constant_zero', constan_zero_imputer_pipline, constant_zero_imputer_features),
('cat', categorical_transformer, categorical_features),
('ordinal', ordinal_transformer, ordinal_features)],
remainder="drop")When we use the transform method the desired dtypes of output for categorical_transformer and ordinal_transformer that should be np.int8 isn't kept. What we are receiving is numpy's ndarray that its dtype is np.float64.
We debuged the code, and found out that in order to join the outputs from the different transformers the ColumnTransformer is using the method np.hstack which creates a unified 1 ndarray with one dtype.
Is there a solution?
If not:
- Please add a warning in the documentation of the ColumnTransformer that the concatenate results won't use the desired dtype of output of the transformers.
- Allow us an option to receive the result without concatenation, as a list of ndarrays.