Skip to content

Commit f819704

Browse files
jorisvandenbosscheglemaitre
authored andcommitted
MAINT: Revert ChainedImputer (scikit-learn#11600)
1 parent 2242c59 commit f819704

File tree

7 files changed

+16
-883
lines changed

7 files changed

+16
-883
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -655,9 +655,8 @@ Kernels:
655655
:template: class.rst
656656

657657
impute.SimpleImputer
658-
impute.ChainedImputer
659658
impute.MissingIndicator
660-
659+
661660
.. _kernel_approximation_ref:
662661

663662
:mod:`sklearn.kernel_approximation` Kernel Approximation

doc/modules/impute.rst

Lines changed: 2 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -16,22 +16,6 @@ values. However, this comes at the price of losing data which may be valuable
1616
i.e., to infer them from the known part of the data. See the :ref:`glossary`
1717
entry on imputation.
1818

19-
20-
Univariate vs. Multivariate Imputation
21-
======================================
22-
23-
One type of imputation algorithm is univariate, which imputes values in the i-th
24-
feature dimension using only non-missing values in that feature dimension
25-
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
26-
algorithms use the entire set of available feature dimensions to estimate the
27-
missing values (e.g. :class:`impute.ChainedImputer`).
28-
29-
30-
.. _single_imputer:
31-
32-
Univariate feature imputation
33-
=============================
34-
3519
The :class:`SimpleImputer` class provides basic strategies for imputing missing
3620
values. Missing values can be imputed with a provided constant value, or using
3721
the statistics (mean, median or most frequent) of each column in which the
@@ -87,60 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
8771
['a' 'y']
8872
['b' 'y']]
8973

90-
.. _chained_imputer:
91-
92-
93-
Multivariate feature imputation
94-
===============================
9574

96-
A more sophisticated approach is to use the :class:`ChainedImputer` class, which
97-
implements the imputation technique from MICE (Multivariate Imputation by
98-
Chained Equations). MICE models each feature with missing values as a function of
99-
other features, and uses that estimate for imputation. It does so in a round-robin
100-
fashion: at each step, a feature column is designated as output `y` and the other
101-
feature columns are treated as inputs `X`. A regressor is fit on `(X, y)` for known `y`.
102-
Then, the regressor is used to predict the unknown values of `y`. This is repeated
103-
for each feature in a chained fashion, and then is done for a number of imputation
104-
rounds. Here is an example snippet::
105-
106-
>>> import numpy as np
107-
>>> from sklearn.impute import ChainedImputer
108-
>>> imp = ChainedImputer(n_imputations=10, random_state=0)
109-
>>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
110-
ChainedImputer(imputation_order='ascending', initial_strategy='mean',
111-
max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
112-
n_imputations=10, n_nearest_features=None, predictor=None,
113-
random_state=0, verbose=False)
114-
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
115-
>>> print(np.round(imp.transform(X_test)))
116-
[[ 1. 2.]
117-
[ 6. 4.]
118-
[13. 6.]]
119-
120-
Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipeline
121-
as a way to build a composite estimator that supports imputation.
122-
See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
123-
124-
.. _multiple_imputation:
125-
126-
Multiple vs. Single Imputation
127-
==============================
128-
129-
In the statistics community, it is common practice to perform multiple imputations,
130-
generating, for example, 10 separate imputations for a single feature matrix.
131-
Each of these 10 imputations is then put through the subsequent analysis pipeline
132-
(e.g. feature engineering, clustering, regression, classification). The 10 final
133-
analysis results (e.g. held-out validation error) allow the data scientist to
134-
obtain understanding of the uncertainty inherent in the missing values. The above
135-
practice is called multiple imputation. As implemented, the :class:`ChainedImputer`
136-
class generates a single (averaged) imputation for each missing value because this
137-
is the most common use case for machine learning applications. However, it can also be used
138-
for multiple imputations by applying it repeatedly to the same dataset with different
139-
random seeds with the ``n_imputations`` parameter set to 1.
140-
141-
Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
142-
allowed to change the number of samples. Therefore multiple imputations cannot be
143-
achieved by a single call to ``transform``.
75+
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
76+
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
14477

14578
.. _missing_indicator:
14679

doc/whats_new/v0.20.rst

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -150,11 +150,6 @@ Preprocessing
150150
- Added :class:`MissingIndicator` which generates a binary indicator for
151151
missing values. :issue:`8075` by :user:`Maniteja Nandana <maniteja123>` and
152152
:user:`Guillaume Lemaitre <glemaitre>`.
153-
154-
- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
155-
values by modeling each feature with missing values as a function of
156-
other features in a round-robin fashion. :issue:`8478` by
157-
:user:`Sergey Feldman <sergeyf>`.
158153

159154
- :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`,
160155
:class:`linear_model.PassiveAggressiveClassifier`,

examples/plot_missing_values.py

Lines changed: 9 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,30 +3,29 @@
33
Imputing missing values before building an estimator
44
====================================================
55
6+
This example shows that imputing the missing values can give better
7+
results than discarding the samples containing any missing value.
8+
Imputing does not always improve the predictions, so please check via
9+
cross-validation. Sometimes dropping rows or using marker values is
10+
more effective.
11+
612
Missing values can be replaced by the mean, the median or the most frequent
713
value using the basic :func:`sklearn.impute.SimpleImputer`.
814
The median is a more robust estimator for data with high magnitude variables
915
which could dominate results (otherwise known as a 'long tail').
1016
11-
Another option is the :func:`sklearn.impute.ChainedImputer`. This uses
12-
round-robin linear regression, treating every variable as an output in
13-
turn. The version implemented assumes Gaussian (output) variables. If your
14-
features are obviously non-Normal, consider transforming them to look more
15-
Normal so as to improve performance.
16-
1717
In addition of using an imputing method, we can also keep an indication of the
1818
missing information using :func:`sklearn.impute.MissingIndicator` which might
1919
carry some information.
2020
"""
21-
2221
import numpy as np
2322
import matplotlib.pyplot as plt
2423

2524
from sklearn.datasets import load_diabetes
2625
from sklearn.datasets import load_boston
2726
from sklearn.ensemble import RandomForestRegressor
2827
from sklearn.pipeline import make_pipeline, make_union
29-
from sklearn.impute import SimpleImputer, ChainedImputer, MissingIndicator
28+
from sklearn.impute import SimpleImputer, MissingIndicator
3029
from sklearn.model_selection import cross_val_score
3130

3231
rng = np.random.RandomState(0)
@@ -71,18 +70,10 @@ def get_results(dataset):
7170
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
7271
scoring='neg_mean_squared_error')
7372

74-
# Estimate the score after chained imputation of the missing values
75-
estimator = make_pipeline(
76-
make_union(ChainedImputer(missing_values=0, random_state=0),
77-
MissingIndicator(missing_values=0)),
78-
RandomForestRegressor(random_state=0, n_estimators=100))
79-
chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
80-
scoring='neg_mean_squared_error')
8173

8274
return ((full_scores.mean(), full_scores.std()),
8375
(zero_impute_scores.mean(), zero_impute_scores.std()),
84-
(mean_impute_scores.mean(), mean_impute_scores.std()),
85-
(chained_impute_scores.mean(), chained_impute_scores.std()))
76+
(mean_impute_scores.mean(), mean_impute_scores.std()))
8677

8778

8879
results_diabetes = np.array(get_results(load_diabetes()))
@@ -98,8 +89,7 @@ def get_results(dataset):
9889

9990
x_labels = ['Full data',
10091
'Zero imputation',
101-
'Mean Imputation',
102-
'Chained Imputation']
92+
'Mean Imputation']
10393
colors = ['r', 'g', 'b', 'orange']
10494

10595
# plot diabetes results

0 commit comments

Comments
 (0)