@@ -16,22 +16,6 @@ values. However, this comes at the price of losing data which may be valuable
1616i.e., to infer them from the known part of the data. See the :ref: `glossary `
1717entry on imputation.
1818
19-
20- Univariate vs. Multivariate Imputation
21- ======================================
22-
23- One type of imputation algorithm is univariate, which imputes values in the i-th
24- feature dimension using only non-missing values in that feature dimension
25- (e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26- algorithms use the entire set of available feature dimensions to estimate the
27- missing values (e.g. :class: `impute.ChainedImputer `).
28-
29-
30- .. _single_imputer :
31-
32- Univariate feature imputation
33- =============================
34-
3519The :class: `SimpleImputer ` class provides basic strategies for imputing missing
3620values. Missing values can be imputed with a provided constant value, or using
3721the statistics (mean, median or most frequent) of each column in which the
@@ -87,60 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
8771 ['a' 'y']
8872 ['b' 'y']]
8973
90- .. _chained_imputer :
91-
92-
93- Multivariate feature imputation
94- ===============================
9574
96- A more sophisticated approach is to use the :class: `ChainedImputer ` class, which
97- implements the imputation technique from MICE (Multivariate Imputation by
98- Chained Equations). MICE models each feature with missing values as a function of
99- other features, and uses that estimate for imputation. It does so in a round-robin
100- fashion: at each step, a feature column is designated as output `y ` and the other
101- feature columns are treated as inputs `X `. A regressor is fit on `(X, y) ` for known `y `.
102- Then, the regressor is used to predict the unknown values of `y `. This is repeated
103- for each feature in a chained fashion, and then is done for a number of imputation
104- rounds. Here is an example snippet::
105-
106- >>> import numpy as np
107- >>> from sklearn.impute import ChainedImputer
108- >>> imp = ChainedImputer(n_imputations=10, random_state=0)
109- >>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
110- ChainedImputer(imputation_order='ascending', initial_strategy='mean',
111- max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
112- n_imputations=10, n_nearest_features=None, predictor=None,
113- random_state=0, verbose=False)
114- >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
115- >>> print(np.round(imp.transform(X_test)))
116- [[ 1. 2.]
117- [ 6. 4.]
118- [13. 6.]]
119-
120- Both :class: `SimpleImputer ` and :class: `ChainedImputer ` can be used in a Pipeline
121- as a way to build a composite estimator that supports imputation.
122- See :ref: `sphx_glr_auto_examples_plot_missing_values.py `.
123-
124- .. _multiple_imputation :
125-
126- Multiple vs. Single Imputation
127- ==============================
128-
129- In the statistics community, it is common practice to perform multiple imputations,
130- generating, for example, 10 separate imputations for a single feature matrix.
131- Each of these 10 imputations is then put through the subsequent analysis pipeline
132- (e.g. feature engineering, clustering, regression, classification). The 10 final
133- analysis results (e.g. held-out validation error) allow the data scientist to
134- obtain understanding of the uncertainty inherent in the missing values. The above
135- practice is called multiple imputation. As implemented, the :class: `ChainedImputer `
136- class generates a single (averaged) imputation for each missing value because this
137- is the most common use case for machine learning applications. However, it can also be used
138- for multiple imputations by applying it repeatedly to the same dataset with different
139- random seeds with the ``n_imputations `` parameter set to 1.
140-
141- Note that a call to the ``transform `` method of :class: `ChainedImputer ` is not
142- allowed to change the number of samples. Therefore multiple imputations cannot be
143- achieved by a single call to ``transform ``.
75+ :class: `SimpleImputer ` can be used in a Pipeline as a way to build a composite
76+ estimator that supports imputation. See :ref: `sphx_glr_auto_examples_plot_missing_values.py `.
14477
14578.. _missing_indicator :
14679
0 commit comments